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@ Method and apparatus for storage device management 



(§) Disclosed is a volume managing system for 
computer storage devices. Physical volumes 
are logically partitioned, with multiple copies of 
data being maintained for system recovery pur- 
poses. A scheme for monitoring, updating, and 
recovering data in the event of data errors is 
achieved by maintaining a volume group status 
area on each physical volunr^, the status area 
reflecting status for all physical volumes de- 
fined for a given volume group. Updates to this 
status area occur serially, thereby protecting 
against all volumes becoming conupted at 
once. A method of updating subsequent status 
changes, whQe the first status change is sUl in 
progress, provides for improved system 
tiiroughput 
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This invention relates in general to data proces- 
sing methods for use in data processing systents for 
managing physical storage space on a storage device 
and in particular to an improved method for maintain- 
ing redundant data on these storage devices. 

The prior art discloses a number of data proces- 
sing systems which employ disk storage devices for 
storing data employed by the system. These devices 
store various types of information such as the operat- 
ing system under which the microprocessor operates, 
different application programs that are run by the sys- 
tem and infonmation that is created and manipulated 
by the various application progran^s. 

Disk storage devices have generally comprised 
one or more magnetic or optical disks having a 
plurality of concentric tracks which are divided into 
sectors or blocks. Each surface of a disk generally 
stores information and disk drives are configured with 
multiple disks and multiple heads to pemrtit one 
access n>6Chanism to position the heads to one of 
several concentric recording tracks. Most current disk 
drives employ an addressing convention that speci- 
fies a physical storage locatk>n by the number of the 
cylinder (CC), the number of the magnetic head (H) 
and the sector number (S). The number of the cylinder 
is also the number of the tracks where multiple heads 
are employed and the head number is equivalent to 
the disk surface in a nujlti-disk configuratbn. The 
"CCHS" addressing fonnat is employed independent 
of the capacity of the disk file since it is capable of 
addressing any configuration that may exist 

The capacity of disk storage devices measured in 
tenms of bytes is dependent on the recording technol- 
ogy employed, the track density, disk size and the 
number of disks. As a result disk drives are manufac- 
tured in various capacities, data rates and access 
times. 

Most data processing systems generally employ 
a number of disk drives for storing data. Since each 
device is a failure independent unit, it is sometimes 
advantageous to spread the data to be stored over a 
number of smaller capacity drives rather than having 
one large capacity device. This configuration permits 
a copy of critical data to be stored in a separate device 
which can be accessed if the primary copy is not avail- 
able. 

A concept known as mirroring can be used to sup- 
port replication and recovery from media failures, as 
is described in Mirroring of Data on a partition basis 
(31514). Research Disclosure # 315. July 1990. whe- 
rein data is mirrored or replicated on a partition(physi- 
cally contiguous collection of bytes) basis. This 
provides even greater flexibility in backup and recov- 
ery of critk^al data because of its finer granularity in 
defining storage areas to be duplicated. 

The task of allocating disk storage space in the 
system is generally the responsibility of the operating 
system. Unix (Trademari( of UNIX System 



Laboratories. Ina) type operating system such as the 
IBM AIX (Trademaric of IBM) operating system which 
is employed on the IBM Rise System/6000 
(Trademaric of IBM) engineering woricstatton have a 
5 highly developed system fbr organizing faes. In Unix 
parlance a Tile" is the basic stiucture that is used for 
storing information that is employed in the system. For 
example a fQe may be a directory which is merely a 
listing of other files in the system, or a data file. Each 
10 file must have a unique identifier. A user assigns a 
name to a file and the operating system assigns an 
inode number and a table is kept to translate names 
to numbere. A file name is merely a sequence of 
characters. Files rxray be organized by assigning 
IS related files to the same directory, which characteris- 
bcally is another file with a name and which merely 
lists the name and inode number of the files stored in 
that directory. 

The AIX operating system also organizes file 
20 direaories in groups which are given a file name since 
they are also consklered to be a file. The resultant 
organization is known as a hierarchical file system 
which resembles an inverted tree structure with the 
root directory at the top and a multi-level branching 
25 structure descending from the root Both directories 
and non-directory type files can be stored at each 
level. Files that are listed by name in a directory at one 
level are located at the next lower level. A file is iden- 
tified in the hierarchical file system by specifying its 
30 name preceded by the descriptk)n of the path that is 
traced from the root level to the nantad file. The path 
descriptor is in terms of the directory names through 
which the path descends. If the current directory is the 
root directory the full path is expressed. If the current 
35 directory is some intenmediate directory, the path des- 
cription may be shortened to define the shorter path. 

The various files of the operating system are 
themselves organized in a hierarchical file system. 
For example a number of sut>directories descend 
40 from the root directory and list files that are related. 
The subdirectories have names such as / which 
stores the AIX kernel files; /bin which store the AIX 
utilities, /tmp which stores temporary files; and Aj 
which store the users files. 
45 As indicated previously the task of assigning AIX 
files to specific addressable storage units on the disk 
drive is the responsibility of the operating system. 
Prior to actually assigning a file to disk blocks, a deter- 
mination is made to divide the available disk storage 
50 space of the storage subsystem into a number of dif- 
ferent areas so each area can store files having the 
same general function. These assigned areas are 
often referred to as virtual disks or logical volumes. 
The term mini-disk is used in the IBM RT system and 
55 the term A-disk in IBM's VM system. The term logical 
volume is used on IBM's AIX system. 

Several advantages are obtained from the 
standpoint of management and control when files 
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havir^ the same characteristics are stored in one 
defined area of the disk drive. For example, a certain 
group of files may not t>e changed at all over a certain 
period time while others may change quite rapidly so 
that they would l>e backed up at different times. It is s 
also simpler for the administrator to assign these files 
to a virtual disk or logical volume in accordance with 
their function and manage all the files in one group the 
same. These are just two examples of many where 
the provision (^virtual disks/logical volumes simplifies io 
the administration and control by the operating sys- 
tem of the storage of files In the storage subsystem. 

Conventional methods for protecting the data 
integrity in data processing systems are not efficient 
in a logical volume environment. What an end user is 
perceives to be a single volume of data could actually 
have data spread across numerous physical volumes. 
US-A- 4,507,751 describes a conventional log write 
ahead method used in a conventional database. 

Methods for extending data Integrity to a virtual 20 
disk system are shown in US-A- 4.498,145, US-A- 
4.4945.474 both and US-A- 4,930,128. However 
these methods introduce considerable overhead in 
maintaining enor logs and recovery procedures. 
These systen» are further limited in that a fault or 25 
enor in the data merely results in the data being res- 
tored to an older version of the data, thereby loosing 
the updated data. 

Other methods provkled data redundancy, where 
an old and new copy of data is maintained. Once the 30 
new copy of data is verified to be valid, rt becomes the 
old copy and what was once old copy can now be 
ovenAAitten with new data. Thus, the old and new 
copies ping-pong back and forth in their roles of hav> 
ing okl or new data. As the numk>er of physical data 3s 
volumes is increased under this n^thod, severe over- 
head impacts system perfonnance and throughput in 
maintaining this technk^ue. 

it is thus desirable to provide for a data proces- 
sing system which has a virtual disk/logical volume 4o 
data system with data redundancy for error recovery 
that has a minimal Impact on system performance. 

According, the present invention provides a 
method for managing a plurality of data storage 
devices associated with a computer system and hav- 45 
ing a first physical volume and subsequent physical 
vdunies and being partitioned into one or more logical 
volumes, each of said logical volumes being further 
partitioned into one or more logical partitions each of 
which comprises one or more physical partittons of so 
said storage devices, saki method comprising the 
steps of: 

detennining status infomnatton for each of said 
physical partitk>ns and recording said status infor- 
mation in a memory of said computer system; 55 

recording said status information in a status 
area existing on each of said data storage devices; 

creating updated status informatk>n when a 



write request is generated for any of sakt physical par- 
titk)ns; 

updating said status area on said first physical 
volume with said updated status Information; and 

updating said status area of each subsequent 
physical volume within said storage devices in suc- 
cession with said updated status information, wherein 
if a second or subsequent write request is receh/ed 
prior to completing an update of each of said storage 
device status areas as a result of a prior write request, 
said status infbnmation is updated in said computer 
memory and used in updating said next succeeding 
physical volume status area. 

The present inventk)n also provides a computer 
system including means for managing a plurality of 
data storage devices associated with said computer 
system and having a first physical volume and subse- 
quent physical volumes and being partittoned into one 
or more logical volumes, each of saki logk:al volumes 
being further partitioned Into one or more togical par- 
tittons each of which comprfs^ one or more physk:al 
paftitk>ns of said data storage devices, said managing 
means comprising: 

means for maintaining status infomiatton for 
each of said physical partitions in a memory of sakJ 
computer system; 

recording means for recording said status 
informatk>n in a status area existing on each of sakJ 
data storage devices; 

means for creating updated status information 
when a write request is generated for any of said 
physical partitbns; 

first update means for updating saki status 
area on said first physical volume with said updated 
status information; and 

subsequent update means for updating sakJ 
status area of each subsequent physical volunrte 
within saki data storage devices in successbn with 
said updated status information, wherein if a second 
or subsequent write request is receh^ed prior to com- 
pleting an update of each of said data storage devioe 
status areas as a result of a prior write request, said 
status information is updated in said computer mem- 
ory and used in updating said next succeeding physi- 
cal volume status area. 

The present invention is directed to the aforemen- 
tioned perfomr^nce problems which are Introduced in 
a system which maintains multiple copies of data. In 
accordance with the new data processing method, a 
physical partitk)n comprising a plurality of physically 
contiguous disk blocks or sectors is established as 
the basic unit of space allocatk)n, while the disk block 
is kept as the basic unit of addressability of the disk 
file. A plurality of physical partitions are grouped 
together and called a physical volume. A plurality of 
physical volumes that are grouped together is refer- 
red to as a volume group. The number of physk:al 
bk)cks contained in each physical partition and the 
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number of physical partitions tn each physical volume 
Is fixed vsfhen the physical volume is Installed into the 
vdunoe group. Stated differently, all physical parti- 
tions in a physical volume group are the same size. 
Different volume groups may have different partition 
sizes. 

When an AIX file system, i.e., a group of related 
flies, is to t)e installed on the system, a logk:al volume 
is created which includes only the minimum numt^r 
of physical partitions on the disk required to store the 
file system. As more storage space is needed tiy the 
file system, the logical volume manager allocates an 
additbnal physical partition to the logical vdume. The 
individual physical partitions of the logk:al volume 
may be on different disk drives. 

A partition map is maintained by the logical 
vdume manager which spedftes the physical address 
of the beginning of each physical partition in terms of 
its device address and block number on the device, to 
assist in correlating logical addresses provided by the 
system to real addresses on tiie disk fQe. 

Data being stored within tiie system can be mfr- 
rored, where redundant copies of the data are stored 
in separate physical partitions. Mirroring is achieved 
by adding an additional structuring mechanism be- 
tween the logical volume and the physical partitions 
therewitiiin. The logical volume is instead made up of 
logical partitions, which function klentically to the 
physical partitions previously discussed. These logi- 
cal partitions are then made up of one or more physi- 
cal partitions. When more ttian one physical partition 
is associated with a logical partition, the logical parti- 
tion is said to be mirrored. When a logical partition is 
minrored then a request to read from a logical partition 
may read from any physical partition. These multiple, 
physical partitions are where tiie redundant copies of 
data are stored. Thus, a logical partition's data can be 
mirrored on any number of physical partitions 
associated therewith. 

When a write request for a logical volume is 
received, data on all physical copies of the logical par- 
titbn, i.e. aB copies of the physical partitions, must be 
written before the write request can be returned to the 
caller or requestor. Whenever data on a logical 
volume is updated or written to, it is possible through 
system malfunctions or physical volume unavailability 
that any particular physical copy will have a write fa3- 
ure. This failure causes tiie data in this particular 
physical copy to be incorrect, and out of synchroni- 
zation with other copies of the same data. When tiiis 
occurs, this physical copy is said to be stale and can- 
not be used to satisfy a subsequent read request 

Status information regarding stale data must be 
stored on permanent storage so ttiat tiie tnfonmation 
is maintained through system crashes/reboots or 
power interruption. This stale infonfnation is stored in 
a Status Area(VGSA) that is written to all active physi- 
cal volumes in the volume group. With mirroring, this 



volume group is a cdtection of physical volumes 
where the physical partitions n^y be aOocated to 
make up the logical partitions of tiie logical vdume. 
Each physical volume contains a copy of tiie V6SA so 

5 that any phystcaJ vdume can be used to detenmlne 
the state of any physical partition allocated to any logi- 
cal volume in the votun^ group. A volume group may 
contain many physical vdumes. and a change in the 
state of any physical partition wOl result in tiie updat- 

10 ing of the VGSAs on each physical vdume. In tiie pre- 
sent prefenred embodiment, there is a limit cS 32 
physical volumes per vdume group. 

When a physical partition goes stale due to a 
write request tiiat has some feOure, tiie originating 

IS request must wait for all VGSAs to be updated witii tiie 
new stats information before being allowed to be 
returned to the caller. If the first request is actively 
updating VGSAs and a second request requires a 
VGSA update, it must wait unta the first Is completed, 

20 causing degradation In system perforrrance. For 
example. In a worst case scenario, where tiie second 
request immediately fdlowed the first request, tiie first 
request would take N x Q time(where N is the number 
of VGSAs to update and Q is the time required per 

25 VGSA) and the second request would similariy lake N 
X Q time, resulting in a delay of 2N x Q for tiie second 
request to return to the originator. 

One possible sdution is to write the VGSAs in 
parallel. However, this allows for the possibility of 

30 loosing e majority of the VGSAs due to a system 
catastrophic faBure. such as a power outage in the 
middle of the write, which could potentially corrupt all 
VGSAs and therefore loose the stale status infor- 
mation for all the physical partitions. Therefore, the 

35 VGSAs must be written serially to prevent this poten- 
tial loss. 

The present invention addresses this problem of 
system degradation, when updating multiple VGSAs 
serially, by using a concept hereinafter called the 

40 Wheel. The Wheel maintains and updates tiie VGSAs 
on all physical volumes in the vdume group for a 
given request The Wheel accepts requests, modifies 
a memory version of the VGSA as per that request, 
initiates tiie VGSA writes, and when all VGSAs have 

45 been updated for ttiat request finally returns that 
request to its originator. The Wheel also ensures tiiat 
the request will not be held up longer than the time it 
takes to write N -o* i VGSAs( again, where N is the 
number of VGSAs and physical volumes in the 

50 volume group), as opposed to other methods which 
could take as long as tiie time it takes to write 2N 
VGSAs. 

In order that the invention will be fully understood 
prefenred embodiments thereof will now be described. 
55 by way of example only, with reference to tiie acconv 
panying drawings in which: 

Ftg. 1 is a functional block diagram of a data pro- 
cessing system in which the method of ttie pre- 
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sent invention may be advantageously employed; 
Fig. 2 is a diagramnuitic Qhistration of the hierar- 
chtca) file system organization of the files contain- 
ing the tnfonnation to be stored on the system 
shown in Fig. 1; 

Fig. 3 is a diagrammatic Qlustration of a disk ffle 
storage device shown functionally in Fig. 1; 
Fig. 4 is a diagram illustrating the physical rela- 
tionships of various physical storage components 
employed in the real addressing arch^ecture of a 
disk file; 

Fig. 5 Oiustrates the general layout of a Physical 
Volun>e. 

Fig. 6 illustrates the general layout of the Logical 
Volume Manager Area; 

Fig. 7 Illustrates the details of the Volume Group 
Status Area Stmcture; 

Fig. 8 illustrates the detaOs of the Volume Group 
Data Area Structure. 

Fig. 9 illustrates the layout of a Logical Volume; 
(There is no F^. 10, 11 ox Fig. 12.) 
Fig. 1 3 aiustrates the system relationship to a logi- 
cal volume manager pseudo-device driven 
Fig. 14 aiustrates the interrelationship t>etween 
logical volumes, logical partitions, and physical 
partitions; 

Fig. 15 Oiustrates the interrelationship t>etween 
physical partitions, physical volumes, and volume 
groups; 

Fig. 16 illustrate the Volume Group Status Area 
WHEEL concept; 

Fig. 17 illustrates the PBUF data structure; 
Figs. 17a and 17b fliustrate ttie PBUF data struc- 
ture elements; 

Fig. 18 illustrates the Logical Volume Device 

Driver Scheduler initisd request policy; 

Fig. 19 illustrates the Logical Volume Device 

Driver Scheduler post request policy; 

Fig. 20 illustrates the Logical Voiunne Device 

Driver Volume Group Status Area processing; 

and 

Fig. 21 illustrates the code used to implement the 

WHEEL function. The code has been written in 

the C programming language. 

Fig. 1 illustrates functionally a typical data proces- 
sing system 10 in which embodies the method of the 
present invention for managing storage space. As 
shown in F^. 1. tiie system hardware 10 comprises a 
microprocessor 12. a memory nr^anager unit 13, a 
main system memory 14, an I/O channel controller 16 
and an I/O bus 21 . A number of different functional I/O 
units are shown connected to bus 21 including the 
disk drive 1 7. The infomnation ttiat is stored in the sys- 
tem is shown functionally by block 11 in Fig. 1 and 
comprises generally a number of application progf- 
rams 22. the operating system kernel 24 which in this 
instance may be assumed to t>e the AIX operating 
system. Also shown is a group of application develop- 



ment programs 23 which nay be tods used by prog- 
ram devetopment personnel during the process of 
developing other programs. 

An example of a comntercial system represented 

5 by Fig. 1 is the IBM Rise System/6000 engineering 
wortcstation which employs tiie AIX operating system. 
The AIX operating system is a Unix type operating 
system and employs many of its features including 
system calls and f9e organization. 

10 FiQ, 2 illustrates the file organization structure of 
the AIX operating system. The k>asic unit of infor- 
mation stored is tenned a °file.° Each file has a name 
such as *^yjrile.001". Files may be grouped togetiier 
and a list generated of all fSe names in the group. The 

IS list Is called a directory and e per se a file, with a nante 
such as *'my.direcL010". The organization shown in 
Fig. 2 is called an inverted tree stmcture since the root 
of the file organization is at the top. The root level of 
ttie organization may contain directory files and ottier 

20 type files. As shown in Fig. 2, a root directory fSe lists 
the names other files OOA, ObB, OOC, OOD, and OOE. 
The files listed In a directory file at one level appear 
as files at the next k)wer level. The file name includes 
a user assigned name and a patii definition. The patii 

25 definition begins at the root directory which, by con- 
vention is specified by a "slash character," </) followed 
by the file name or the directory name tiiat is in the 
path that must be traced to reach the named fOe. 
Each of the program areas shown in block 1 1 in 

30 Fig. 1 includes a large number of individual files which 
are organized in the manner shown in Fig. 2. The term 
"File System" is used to identify a group of files that 
share a common multMev^ path or a portion of their 
respective multi-level paths. 

35 The method dS the present invention junctions to 
manage storage space on the disk drive 17 shown in 
Fig. 1 for all of the ffles represented in t>lock 1 1 of Fig. 
1 and the fDes that would be represented on the 
hierarchical storage system shown in Rg. 2. 

40 The disk drive 17 in practice may comprise a 
plurality of individual disk drives. One such device is 
shown dia^ammatically in Fig. 3. The devk:e as 
shown in Fig. 3 comprises a plurality of circular mag- 
netic disks 30 whk;h are mounted on a shaft 31 which 

45 is rotated at a constant speed by nfK)tor 32. Each sur- 
face 33 and 34 of the disk 30 is coated with magnetic 
material and has a plurality of concentric magnetic 
tracks. Otiier embodiments would have disk 30 
coated with material to allow optical storage of data. 

50 The disk drive 17 further includes a mechanism 
35 for positioning a plurality of transducers 36, one of 
each being associated with one surface, conjointty to 
one of the concentrically recording track positions in 
response to an address signal 36 supplied to actuator 

55 37 attached to move carriage 38. One recording track 
on each surface of each disk belongs to an imaginary 
cylinder of recording tracks that exist at each track 
position. 
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The physical address to the disk drive takes the 
fofm of a five byte address designated ''CCHS" where 
CC represents the cylinder or tradk numt>er, H repre- 
sents the number assigned to the magnetic head or 
transducer which also conres ponds to the disk surface 
since there is one head per surface, and S represents 
the sector or l>lock number of a portion of the track. 
The block is established as the smallest unit of data 
that can be addressed on the device. Other embodi- 
ments could support other physical head to disk con- 
figurations and still be within the scope of this 
invention. For example, instead of a single head or 
transducer conresponding to each disk surface, mul- 
tiple heads or transducer might be utilized to reduce 
the seek time required to attain a desired track loca- 
tion. 

From a programming standpoint a disk drive is 
sometimes referred to as a Physical Volume (PV) and 
is viewed as a sequence of disk blocks. A Physical 
Volume has one device address and cannot include 
two separate disk devices since each device has a 
separate accessing mechanism and requires a 
unique address. 

Fig. 4 Slustrates the physical relationship of the 
various storage elements involved in the addressing 
architecture of a disk drive which to a large extent is 
generally standardized in the industry. 

Each byte position 40 stores one byte of data. The 
sector or block 41 comprises a specified plurality of 
sequential or contiguous byte positions generally 512 
and is the lowest level of an addressable element. 
Sectors or blocks 41 are combined into tracks 42 , 
which are combined into surfaces 33 and 34, which 
are combined into disks 31 ,32 .... which are conrv 
bined into disk drives or disk storage devices 17 of 
Fig. 1. If more than one disk storage device 17 is 
employed the combination of two or more devices Is 
referred to as a physical string of disk drives or disk 
files. In practice a disk or a disk track 42 may contain 
one or more sectors 17 having a number of defects 
suffident to render the block unusable. 

The layout of a Physical Volume is shown in Fig. 
5. Each physical volume, for example each separate 
disk drive, reserves an area of the volume for storing 
infonmatk>n that is used by the system when the power 
is first turned on. This is now a standard convention 
in the industry where, for example, tracks or cylinders 
0-4 are reserved for special information. 

Each physical volume reserves at least two cylin- 
ders for special use. The Boot Code, which may be 
used to load diagnostics software, or the kernel of the 
Operating System, is held in a normal logical volume 
and no longer requires a special physical volume 
locatbn. 

The first reserved cylinder is cylinder 0, the first 
cylinder on any physical volume. Each physical 
volume uses the first four tracks of cylinder 0 to store 
various types of conftguration and operation Infor- 



mation atKMJt the Direct Access Storage Devices 
(DASD) that are attached to the system. Some of this 
infonmatton Is placed on the cylinder by the physk:al 
volume manufacturer, and some of it is written by the 

5 operating system on the first 4 tracks of cylinder 0. 

The second reserved cylinder on the physk:al 
volume Is for the exclusive use of the Customer 
Engineer and is called the CE cylinder. This is always 
the last cylinder on the physical volume and Is used 

10 for diagnostic purposes. The CE cylinder cannot be 
used for user data. The Boot Code area and the Non- 
Reserved area are pointed to by the contents of an 
IPL Record interpreted in the context of the contents 
of a Configuration Record. 

f 5 The Initial Program Load (IPL) Record consisting 
of one block contains information that allows the sys- 
tem to read the Boot Code (if any) and initialize the 
physical volume. The IPL Record can be divided into 
four logical sections: The first section is the IPL 

20 Record ID. The second section contains format infor- 
mation about the physical volume. The third section 
contains Information about where the Boot Code (If 
any) is located and its length. The fourth section con- 
tains informatton about where the non-reserved area 

25 of the physical volume is located and its length. 

One track is also reserved for the Power On Sys- 
tem Test (POST) control block that is created in menv 
ory during system initialization. 

The first part of the non-reserved area of a physi- 

30 cal volume contains a Logical Volume Manager Area. 
The invention hereinafter disclosed is primarily con- 
cerned with the nr^nagement of this Logical Volunr^ 
Manager Area. Fig. 6 is an exploded view of the Log'h 
cal Volume Manager Area, which has a Volun^e 

35 Group Status Area and Volume Group Data Area. 
Secondary copies of these areas may also 
immediately follow the prinnary copies. To save space 
on the physical volumes, the size of this Logical 
Volume Manger Area is variable. It is dependent on 

40 the size of the physical volume and the number of logi- 
cal volumes allowed In the volume group. 

As previously nr^entbned. each physical volunne 
contains a Volume Group Status Area(VGSA). The 
Status Area indicates the state of each physical par- 

45 tition on the physical volume. Every physical volunne 
within a volume group contains an identical copy of 
the Status Area. The Status Area can be duplicated 
on the same physk:al volume. Is not contained within 
any physical partitk)n, and has the format shown in 

50 Fig. 7. The Status Area should be allocated on DASD 
In such as way as to reduce the probability of a single 
failure obliterating both copies of It. 

The details of the Status Area are shown in Fig. 
7. The various fields within the Status Area are inter- 

55 preted as follows: 

BEGINNING^TIMESTAMP and ENDING.TIMES- 
TAMP are used when the VG is varied on to validate 
the VGSA and control the recovery of the most recent 
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V6SA. Each tbneslamp is 8 bytes long. The recovery 
and validation process wSI be discussed later. 
PARTITION.STATE.FLAGS occupy the ranainder 
of the V6SA. The flags are evenly divided among the 
maximum of 32 PVs in a VG. That means each PV has 
127 bytes of state flags. That leaves 24 bytes of the 
4098 in each VGSA unused. It also limits the nunnber 
of PPs on any PV to 127 x 8 or 1016 partitions. This 
should not restrict the use of any portion of any disk 
since the size of the partitions are not a factor, only the 
quantity. 

As previously nDentioned, each physical volume 
contains a Volume Group Data Area(V6DA). The 
Data Area indicates the interrelationship between the 
logical and physical volumes and physical partitions. 
The Data Area can be duplicated on the same phys'n 
caS volume, is not contained within any physical par- 
titbn, and has the fonmat shown in Fig. 7. The Data 
Area should be aliocated on DASD in such as way as 
to reduce the probability of a single failure obliterating 
both copies of It The details of the Data Area are 
shown In Fig. 8. The various fields within the Data 
Area are described at pages 8 and 9 of Fig. 21. The 
VGDA is a variable sized object that is user defined 
when the Volume Group is created. 

Referring again to Fig. 5, a User Area follows the 
Logical Volume Manager Area, and contains the nor- 
mal user data area. 

A bad block pool area in Fig. 5 is also provided 
which supplies substitutton blocks for user area 
btocks that have been diagnosed as unusable. It will 
be assumed in the remaining description that there 
are no bad bk>cks on the disk or if there are they are 
handled by any of the well known prior art techniques. 

Fig. 9 indicates the layout of a logical volume 
where block numbers are decimal. Logical partitk>n 
size shown is 64 Kilobytes (128 logical blocks). 

In the prefenred embodiment, the method of the 
present invention is implemented by a file 
named /dev/ium which is called the Logical Volume 
Manager. 

The Logical Volume Manager (LVM) provides the 
ability to create, modify and query logical volumes, 
physical volumes and volume groups. The LVM auto- 
matically expands logical volumes to the minimum 
size specified, dynamically as more space is needed. 
Logical volumes can span physical volumes in the 
same volume group and can be mirrored for high 
reliability, availability, and performance. Logical 
volumes, volume groups and physical volumes all 
have IDs that uniquely identify them from any other 
device of their type on any system. 

The LVM comprises a number of operations per- 
formed by calls to the SYSCONFIG system call. 
These SYSCONFIG calls including the processes for 
creating and mainteining intemal data structures hav- 
ing volume stetus information conteined therein. 
These system calls are more fully described in the 



IBM manual °AIX Version 3 for RISC System/6000. 
Calls and Subroutines*' Reference Manual: Base 
Operating System, Vo9. 2. 

A Logical Volume Manager pseudo devtee driver 

5 64 is shown if Fig. 1 3, and consiste of three concep- 
tual layers. A strategy layer 65 interfaces to the file 
system I/O requests 68. a scheduler layer 66 to be 
described later, and a physical layer 67 which inter- 
faces to the normal system disk device drivers 69, 

10 both logical and physical. This pseudo device driver 
64 intercepts fSe system I/O requests 68 destined to 
and from the disk device drivers 69 and perfonro the 
functions of mirroring, stale partitk)n processing, 
Stetus Area management, and Mirror Write Consis- 

15 tency, all of whose operations and functions wQI now 
be described. 

Mim)ring 

20 Mirroring is used to support replication of date for 
recovery from nrtedia faSures.lNorrrally, users have 
specific files or filesystems that are essential and the 
loss of whtoh would be disastrous. Supporting minor- 
ing only on a complete disk basis can waste a oon- 

25 stderable amount of disk space and result in more 
overhead than is needed. 

A partition is a fixed sized, physically contiguous 
collection of bytes on a single disk. Refennng to Figure 
14, a logical volume 70 is a dynamically expandable 

30 logical disk made up of one or more logical partitions 
71 . Each logical partition is backed up by one or more 
physical partitions, such as 72, 74 and 76. The logical 
partition is backed up by one(72) if the partition is not 
minrored, by two(72 and 74) if the partition is singly 

35 mirrored, and by ttiree(72, 74, and 76) if the partition 
is doubly minrored. 

Mirroring can be selected in the following ways for 
each logical vdume: (i) none of the logical partitions 
in a logical volume can be mirrored, (ii) all of the logi- 

40 cal partitions in a logical volume can t>e mirrored, or 
(iii) selected logical partitions in a logical volume can 
be mirrored. 

Stale Partition Processing 

45 

In order for mirroring to function property, a 
method is required to detect when all physical parti- 
tion copies of the mimored date are not the same. The 
detection of stele physical partitions(PP) and initiation 

50 of stale physical partition processing is done in Fig. 1 3 
at tiie scheduler layer 66 in tiie driver 64. This 
scheduler layer has two I/O request policies, initial 
request and post request The initial request policy 
receives and processes requests from the strategy 

55 layer 65 and Is illustrated in Fig. 18. The post request 
policy interfaces with the physical layer 67 and is aius- 
trated in Fig. 19. A description of the functions witii'in 
these policies follows. 
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Initial Request Policy - 
REGULAR 

Returns ElO for request that avoids the only copy, 
or the target PR is t>eing reduced, or target PV is mis- 
sing. If request is a special VGSA write, REQ.VGSA 
s^ in bjoptions, then a special pbuf(saj>buO is used 
which is embedded in the volgrp structure, Instead of 
allocating a pbuf from the free pod. 

SEQUENTIAL 

Returns ElO for request that avoids all active 
copies(mirrors). Read requests select the partition to 
read from in primary, secondary, tertiary order. Write 
requests select the first active copy and initiate the 
write. The remaining copies are written in sequential 
order, primary, secondary, tertiary, after the preced- 
ing partition has been written. Sequential only initiates 
the first physical operation. Any subsequent oper- 
ations, due to read enrors, or multiple writes are han- 
dled by the post request policy. Sequential does not 
write to partitions that are stale or are on PVs with a 
status of missing. 

PARALLEL 

Returns ElO for request that avoids all active 
copies(mirrors). Read requests read from the active 
partition that requires the least amount of PV head 
movement based on the last queued request to the 
PV. Write requests generate writes to all active parti- 
tions simultaneously, i.e. in parallel. PARALLEL does 
not write to partitions that are stale or are on PVs with 
a status of missing. 

AVOID 

Builds an avoid mask for the mirrored policies. 
This mask infomns the scheduling policy which parti- 
tions to avoid or not use. Following is a descriptbn of 
when mirrors are to be avoided. 

GENERAL - Applies to both read & write 
requests. 

i) Non-existent partitions or holes in a logical par- 
tition. 

ii) Explicltty avoided by request There are 
bits(AVOID_C1.2.3) in the b_option field of the 
request that explicitly avoid a copy(used for read 
requests only). 

READ - Applies to read requests only. 

i) Partitions located on PVs with a status of mis- 
sing. 

ii) Partitions ttiat are in the process of being 
reduced or renr>oved. 

iii) Partitions that have a status of state. 
WRITE - Applies to write requests only. 



i) Partitions ttiat are in the process of being 
reduced or removed. 

Ii) Partitions ttiat have a status of stale and ttiat 
status is not in transition from active to stale. 
5 iii) If ttiere is a resync operation in progress In the 
partition and the write request is t>ehind the cur- 
rent position of the resync position ttien allow the 
write even if ttie partition status is stale. 
If the request is a resync operation or a mirror write 
10 consistency recovery operation, the sync-mask is 
also set The sync-mask informs the resyncpp post 
request policy which partitions are cun^ntty stale and 
ttierefbre which ones to attempt to write to once good 
data is available. 

IS 

Post Request Policy 



20 



25 



FINISHED 

Generally ttie exit point from the scheduler layer 
back to the strategy layer. F^esponsible for moving 
status from ttie given pbuf to the Ibuf . If the pbuf Is not 
related to a VGSA write, REQ^VGSA set In bjop- 
tions, the pbuf is put back on the free list 

MIRREAD 



Used by both sequential and parallel policies 
when the request is a read. It has the responsibility of 

30 checking the status of the physical operatk>n. If an 
em>r is detected it selects anottier active mlnror. It 
selects the first available mirror in primary, secondary, 
tertiary order. When a successful read is complete 
and there were read errors on other minrors MIRREAD 

35 will initiate a fixup operatk>n via FIXUP. 

SEQWRITE 

Used by the sequential policy on write requests. 
40 It has the responsibility of checking the stahjs of each 

write and starting ttie write request to the next mirror. 

Writes are done in primary, secondary, tertiary order. 

When all active mirrors have been written, any mirrors 

that failed are mariced stale by the WHEEL(to be des- 
45 cribed hereafter). 

PARWRITE 

Used by the parallel policy on write requests. The 
50 initial parallel policy issued physical requests to all 
mirrors in parallel. PARWRITE checks tiie status of 
each of the completed physical requests. PARWRITE 
remembers only if a write error occun-ed or not PAR- 
WRITE puts ttie pbufs back on the free list as they 
55 complete and coalesces the status into an outstand- 
ing sibling. Therefore the last physical requestto com- 
plete holds the pass/fail status of all the siblings 
including itself. If any write errors are detected the 
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affected mirrors will be rr^ed stale(by the WHEEL) 
only after alt physical requests for a given logical 
request are complete. 

FIXUP 



ACTIVE 

The partition is avaiable for all I/O. Read 
requests for the LP can read from this PP. Writes to 
5 the LP must write to this PP. 



Used to fa a brolcen mirror, one that had a read 
eiTor, after another minor was read successfully. 

RESYNCPP 

Used to resynchronize a logical partition, LP,. The 
initial policy, sequential or parallel, selects an active 
mirror, one not stale or on a missing PV, to read from 
first RESYNCPP checks the status of the read. If an 
ent>r is detected RESYNCPP will select another mir- 
ror if one Is available. Once a successful read has 
been done RESYNCPP will write that data to any stale 
physical partition in the LP. RESYNCPP does not 
attempt to fix brolcen minors. i.e. ones that failed the 
initial read. RESYNCPP Is also used to do MIRROR 
WRfTE CONSISTENCY RECOVERY(MWCR) oper- 
ations. During MWCR operations RESYNCPP will 
mark partitions stale if a write fais in the partition. 

SEQNEXT 



STALE 

The partition cannot be used for normal I/O. The 
10 data in the partition is inconsistent with data from its 
peers. It must be resynchronized to be used for nor- 
mal I/O. it can be reduced or removed from the LP. 

REDUCING 

15 

The partition is being reduced or removed from 
the LP by the configuration routines. Any reads or 
writes that are cunrently active can be completed 
because the configuration routines must drain the LV 

20 after putting the partition In this state. The initial 
request policies must avoU tliis PP if a read request 
is receh^ed when ttie PP Is In this state. The configu- 
ration routines will also turn on tiie stale flag under 
certain conditions to control write requests that may 

25 be received. These configuration routines are further 
descrit)ed hereafter. 



Used to select the next active minnor considering 
the ones already used, stale partitions, and missing 
PVs. 

Referring now to Fig. 14, each PP 72 that is 
defined in the volume group has state information in 
the partition structure. Each PP must t>e in one of two 
permanent states. It can be active, available for all I/O, 
or stale, not available for all I/O. In addition, there are 
two intenmediate states, called reducing and chang- 
ing. The penmanent state of each PP in the volume 
group 84 is also maintained in the Status Area(VGSA) 
82. as shown in Fig. 6f. A copy of the VGSA 82 resides 
on each physical volume 80 of the volume group 84. 

This allows the state of each partition to be 
retained across system crashes and when the VG is 
not online. The driver 64 has the responsibility of 
maintaining and updating the VGSA. Stale PP proces- 
sing is not complete until all VGSAs have been 
updated. The VGSAs are updated by a mechanism 
hereinafter called ttie WHEEL, which is described 
hereafter, and references will be made that indicate 
requests will be going to or returning from the VGSA 
WHEEL 90 shown in Fig. 16. The VGSA WHEEL'S 
request object is tiie physical request(PR) or pbuf 
stmcture. It accepts pointers to PRs and returns them 
via the pb_sched field of the same structure when all 
VGSAs have been updated. 

Following is a description of each PP state. 



CHANGING 

30 The partition has changed states from active to 

stale and the initial request that caused that change 
has not been returned from the VGSA WHEEL A read 
request to a LP that has a PP changing must avoid the 
PP. A write request to ttie cannot be returned until 

35 the WHEEL returns the initial request that caused the 
state change. This is done by actually building the PR 
and then handing it off to ttie VGSA WHEEL. The 
WHEEL handles duplicate operations to the same 
partition and will return them when the initial request 

40 is returned. 

There are some general rules that apply to logical 
requests(LR) and PPs when they encounter stale PP 
processing. First, once a partition goes stale it cannot 
accidentally become active again due to a system 

45 crash or error. There is one exception to this, if the VG 
was forced on with the force quorum flag the selected 
VGSA may not have contained the latest PP state 
information. If a user forces the VG, they take their 
chances. Secondly, a LR will not be returned until all 

50 stale PP processing is complete. This means that all 
VGSAs have been updated. 

It is an Blegal state for all copies of a logical par- 
tition(LP) to be mariced stale. There must be at least 
one active partition. That one active partition can be 

55 on a PV tiiat is missing. All writes to that LP will fail 
until the PV is brought back online. Of course the 
entire LP can be reduced (removed) out of the LV. 
If all copies of a LP have write faiures then all but 
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one copy will be marked stale before the LR is retur- 
ned with an error. Since there is no guarantee that all 
the writes faOed at the same relative offset in the PRs, 
the assumption must be made that possible inconsis- 
tencies exist between the copies. To prevent two dif- 
ferent reads of the same logical space from returning 
different results (Le. they used different copies), the 
number of active partitions must be reduced to one for 
this LP. The only exception to this is when no PR has 
been issued before the detection that all copies will 
faO, which might occur if the logical vo!unfte(LV) is 
using the parallel write policy, which is described 
hereafter. 

There are three ways a PP can becon^e stale. The 
first is by the system managen^ent mechanism that 
extends a LP horizontally when valid data already 
exists in at least one PP for this LP or the PP is being 
reduced(removed). This wai be referred to as the con- 
fig method. 

A partition can beconrte stale when a write to its 
respective LP is issued and the PV where the PP is 
located has a status of missing. This type of staieness 
is detected before the physical request is issued to the 
PV. This wSl be referred to as the missing PV method. 

Finally, a PP can become stale when a write to it 
is returned with an enror. This wOl be refenred to as the 
write error method. 

A more detailed discussion of action and timing 
for each method follows. 

CONFIGURATION METHOD 

The config method is really outside the nomfial 
flow of read and write requests that flow through the 
scheduler layer. It is important that the driver and the 
configuration routines stay in sync with each other 
when the state of the PP is changing. A set of proced- 
ures is defined later that covers how this is done. 

MISSING PV METHOD 

Detecting that a PR is targeted for a PV that has 
a status of missing must t>8 done before the request 
is issued, as shown in Fig 19. All mirrored write poli- 
cies 96. 100, 122 and 124 must check the target PVs 
status before issuing the PR to the lower levels 106. 
If that status is detected the PR will be sent to the 
VGSA WHEEL 90. The PR must have the proper post 
policy encoded, the B_DONEflag reset In the b^flags 
field and a type field that requests this PP be marked 
stale. The post request policy. Fig. 19, will decide 
what action is next for this LR when the VGSA WHEEL 
returns the PR. The one exception to this is in the ini- 
tial request parallel policy. If it detects that all active 
partitions are on PVs with a status of missing, it can 
return the LR with an error, ElO, and not mark any par- 
titbns stale. It can do this because the data is still con- 
sistent across all copies for this request 



WRITE ERROR METHOD 

When a request is returned from the physical 
layer 108. to the post request policy shown in Fig. 19, 

5 with an error, the policy must decide if the partition is 
to be marked stale. There are several factors involved 
when deciding to nr^ a mirror stale. A few are men- 
tioned below with references to Fig. 19. 

If the post policy is sequential 98 and this is the 

10 last PR for this LR and all other previous PRs foOed 
(and their partitions marked stale), then this partition 
cannot be marked stale. If It were marked stale then 
all copies of this LP would be merited stale and that 
is an Illegal state. 

15 Resync operations 1 02 do not mark mirrors stale, 
but if the write portion of a resync operation falls then 
the faOIng partition cannot be put into an active state. 

Minror write consistency recovery operations will 
mark a minror stale if the write to the mirror falls. 

20 In any case, if the partition is to he nrtari^ed stale 
the PR must be set up to be sent to the VGSA 
WHEEL. This entails the proper post policy be 
encoded, the B_DONE flag reset (In the b.flags field) 
and a type field be set that requests this PP be marked 

25 stale. When the PR is returned by the VGSA WHEEL 
the receiving post policy will decide what action is next 
for this PR and the parent LR. 

Any post request policy that receives a PR from 
both the physical layer 108 and the VGSA WHEEL 90 

30 must query the B_DONE flag in the b_f1ags field to 
determine the origin of the PR. Since the post request 
policy handles PRs, from both the physical layer and 
the VGSA WHEEL, It makes all the decisions con- 
coming the scheduling of actk)ns for the request and 

35 when the LR request is complete. 

Now that the states of a PP have been defined, 
the procedures for handling a request in relationship 
to those states must be defined. Also defined are the 
procedures the configuration routines and the driver 

40 must follow for changing the states in response to sys- 
tem management requests. 

Driver only procedures 

45 1) State is acth^e 

Read requests may read from the partition. 

Write requests in the initial request policies must 
write to the partition. 
50 Write requests in the post request policies of Fig. 
1 9 that are returned with an error must 

i) Tum on the changing flag and stale flag. The 

partition has just changed states. 

i) Remember that the PR failed. 
55 III) Hand the PR off to the VGSA WHEEL 90. 

iv) When the PR is retumed from the WHEEL 90 

the changing flag must be turned off. The partition 

has just changed states again. 

10 
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2) State is stale 

Read requests in the initial request policies nnust 

avoid the partition. 

Write requests in the initial request policies will 

avoid the partition. 

Write requests in the post request policies that 

are returned with an error must 

i) Rememl>er that the PR foiled. Since the chang- 
ing state flag is not on, that is all the action neces- 
sary at this point There is no need to send this 
request to the WHEEL 90 t>ecause the partition is 
already niarked state. This condition can happen 
t)ecause the partition was active when the 
request was handed off to the disk device driver, 
txjt by the time it returned another request had 
already failed, marked the partition stale via the 
V6SA WHEEL and retumed from the WHEEL. 
Therefore, there is nothing for this request to do 
concerning stale PP processing. 

3) State changing firom active to stale. 

Read requests in the initial request policies of Fig. 
18 must avoid the partition altogether. 

Write requests in the initial request policies of Fig. 

18 must t>e issued to the disk drivers as if the partition 
was not changing states. The post request policies of 
Fig. 19 will handle the request when it is retumed. 

Write requests in the post request policies of Fig. 

1 9 that are retumed with an error must 

i) Remember that the PR failed. 

ii) Hand the PR off to the VGSA WHEEL 

iii) When the PR is retumed from the WHEEL the 
changing flag should have been turned off. This 
request can now proceed. 

NOTE: Post request policies that find that a read 
request has failed Just select another active partition 
and retry the read. Read errors do not usually cause 
data inconsistencies, but, as usual, there is one 
exception. There is a post request policy called fixup 
100. This policy attempts to fix broken mirrors, i.e. 
ones that have had a read error. It fixes these broken 
mkrors by rewriting them once a successful read has 
been completed from another mirror. If the rewrite of 
a broken mirror fells this partition must be merited 
stale since it is now possible for the data to be incon- 
sistent between the mirrors. 

Configuration - Driver procedures 

1) Partition created stale 

When a LP that already contains valid data is hori- 
zontally extended, the resulting driver structures and 
the VGSA must indicate that the partition is stale. This 
means the VGSAs must all be updated before the 
configuration operation is considered complete. A 



more detailed procedure can be fbund In the VGSA 
discusston to follow. 

i) Configuration routines set up the permanent 
state of each PP being allocated via the VGSA 

5 WHEEL See the VGSA discussion for a more 

detaBed breakdown of what must be done in this 
step. 

ii) Configuratk>n routines set up driver structures 
and link them into the existing driver lnfonnatk>n. 

10 If the partitk)n is active it can he used 
immediately. If stale, it must be resynchronized 
before it will be used. This step should be done 
disabled to INTIODONE to inhibit any new 
requests from being scheduled while the configu- 

15 ration routines are moving driver structures 
around. 

2) Reducing an active or stale partitton 

70 The procedure below will woric for redudng both 
an active partition or a stale i>art]tk>n. It is very high 
level. A more detailed procedure can be found In the 
VGSA discussion. 

i) The configuration routines set the state flag for 
25 each PP being reduced(renrx>ved) and initiates 

the update of the VGSAs. This is done via the 

configurationA/GSA WHEEL interface. 

NOTE: This is not necessary if all PPs being 

reduced are already stale. 
30 ii) With the state of each partition now stale and 

recorded permanently the LV must he drained. 

Draining the LV means waiting for all requests 

currently in the LV work queue to complete. 

NOTE: This is not necessary if all PPs being 
35 reduced are already stale. 

iii) Dlsabl6d to IIMTIODONE, the configuration 
routines may now remove the driver structures 
associated with the PPs being removed. 

40 3) Stale PP resynchronization 

Up to this point the discussion has centered on 
marking partitions stale. There is another side to this 
issue. How is the data made consistent between 

45 copies so all are available and active again? This 
operation is called a resync operation. The 
resynchronization of an entire LP is accomplished by 
an applicatk>n process issuing, via the character 
device node of the LV, multiple resync requests start- 

50 ing at the beginning of the LP and proceeding sequen- 
tially to its end. This must be done by issuing readx 
system calls with the ext parameter equal to 
RESYNC_OP as defined in sys/lvdd.h. Each request 
must start on a togical track group(LTG) boundary and 

55 have a length of one LTG. A LTG is 128K bytes long. 
Therefore, to resynchronize a 1 MB LP a series of 8 
of these requests would have to be made. After the 
8th resync operation if there were no write errors in the 

11 
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partition by any operatk)n, resync or normal write, the 
VGSA is updated to indicate that the newly 
synchronized partitions are now fresh and active. 

Each resync request is nnade up of several physi- 
cal operations. The first operation is a read and it is 5 
initiated by the initial request policy of Fig 1 8. The post 
request policy of RESYNCPP in Fig. 19 veriftes that 
the read was done without errors. If an en^ is retur- 
ned another active minor is selected to read from. If 
there are no other active mirrors the resync request is io 
retumed with an error, and the resynchronization of 
the LP is aborted. It must be restarted from the begi- 
nning. 

The next operation of the resync request is to 
write the data, just read, to any stale partitions using . is 
a sequential write type policy. If an error is retumed 
the partition status is updated to indicate the partition 
has had a sync-error. 

If ail stale partitions have a status of syno-enor 
the resynchronization of the LP is aborted. If aSI the 20 
LTGs of one PP are successfully resynchronized then 
that PP will change status firom stale to acth/e. 

Following is a 1st of actions and decisbns sur- 
rounding the resynchronization of a LP. 

i) The synchR>nization of a LP is initiated by issu- 25 
ing a resync request at the ftrst LTG in the parti- 
tion and proceeding sequentially through the last 
LTG in the partition. The LP must have stale mir- 
rors or this initial request will be returned with an 
error. The sync-enor status of each PP will be 30 
cleared and the internal resync posrtion(LTG 
number) maintained by the driver will be set to 0. 
This internal resync position is referred to as the 
sync track. A sync track value that is not OxFFFF 
indicates that this LP is k>eing resynchronized and 3S 
what track is currentiy being done or was last 
done. There is a flag in the PP state field which 
qualifies the sync track value; it is called the Re- 
sync-ln-Progress(RIP) flag. When the RIP flag is 

on. the sync track value represents the LTG cur- 40 
rently being operated on. If the RIP flag is reset 
the sync track value represents the next LTG to 
he operated on. This is how the driver rernembers 
the position of the resynchronizatk)n process and 
allows nonnal read/write operations to proceed 45 
concurrently with the resynchronization of the LP. 

ii) A LTG will be resynced if the partition is stale 
and the sync-error flag is reseL 

iii) Any write error in the LTG being resynced will 
cause that partition to have a status of sync-error. so 
Writes in a LP that occur behind the sync track 
write to all PPs even though they may be stale. 

The exception to this is if the partition has the 
sync-enror flag on. Consequentiy, any write errors 
cause the copies to be inconsistent again. There^ 55 
fore, these write errors must turn on the sync-er- 
ror flag to let tiie resynchronization process know 
that an error has occurred behind it in this parti- 



tion. 

iv) An individual resync request is considered 
successful if at least one of the LTGs currentiy 
being resynced completes with no errors. 

v) If all stale partitions In a LP devetop a status of 
sync-error the resynchronization of tite LP is 
aborted. It must be restarted from the beginning 
of the LP. 

Recovery of the VGSA at VG varyon time is 
addressed by the discussion of ttie VGSA and the 
VGSA WHEEL. 

VOLUME GROUP STATUS AREA (VGSA) 

Each physical partition(PP) in tiie vdun^e 
group(VG) has two permanent states, active or stale. 
These states are maintained in tiie Status 
Area(VGSA). There is a copy of the VGSA on each PV 
in ttie VG, as shown In Fig. 15. Some PVs may have 
mare tiian one copy. The VGSA copies on all PVs, 
along witii a nrierrK^ry versidn, are maintained by 
software In the driver 64 ttiat runs in the scheduler 
layer 66 in Fig. 13. This software accepts requests 
from tiie scheduling polk;ies of F^. 18 & Fig. 19 or the 
configuration routines to mark partitions stale or 
active. This software is called the WHEEL t>ecause of 
the way it controls and updates the active VGSAs in 
tiie VG. Refer to Fig. 16 for ttie following discussion 
on tiie WHEEL 

The basic object of the WHEEL is to ensure that 
all VGSAs in the VG are updated with the new state 
informatton for any given WHEEL requesL It would k>e 
easy and relatively fast to issue write requests to all 
VGSAs at the same tone. But, that would also be very 
dangerous since with that method it is po8sit>le to 
have a catastrophic error that would cause tiie loss of 
all VGSAs in the VG. That brings up ttie first of the 
general rules for the WHEEL 

General Rule 1) 

Only one VGSA write can be In flight at a time. 

Refer to Fig. 16. When a request is received by 
the WHEEL the memory version of ttie VGSA is 
updated as per the request Then VGSA 1 is written. 
When it is complete a write to VGSA 2 is issued. This 
continues until VGSAS has been written. The WHEEL 
is now back at VGSA 1 where it started. Now the 
request is retumed back to the normal flow of the 
driver, as shown in Fig. 19, so it can continue to its 
next step. The second general rule is: 

General Rule 2) 

A request cannot be retumed until all VGSAs in 
the VG have been updated w^ ttiat request's oper- 
ation. 

It should be obvious now why this is called the 
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WHEEL It should be equally obvious that any request 
on the WHEEL n^y stay there a while. In the above 
example the request had to wait for 8 complete disk 
operations. If a VG contained 32 PVs a request would 
have to wait for 32 disk operations. Now, assume 5 
while the request was waiting for the write to VGSA 1 
to complete another request carm in and wanted to 
update the VGSA. If the second request had to wait 
for the first request to get off of the WHEEL It would 
have to wait for 1 6 disk operations before it could con- io 
tinue. Eight disk operations for the first request and 8 
for itself. This wait time could beconne quite large if the 
VG contained a large number of PVs. Luckily, the 
WHEEL has several conditions to reduce this wait. 

The WHEEL manages the requests it receives so is 
that no one request must wait, stay on the wheel if you 
will, longer than the time It takes to write the total num- 
ber of VGSAs In the VG plus one. This is acconv 
plished by allowing requests to get on the WHEEL 
between VGSA writes. A request then stays on the 20 
WHEEL untQ the WHEEL rdls back around to the 
posltbn where the request got on the WHEEL. Once 
the WHEEL has been started it is said to be rolling. 
Onoe rolling It w9l continue to write the next VGSA 
unto all VGSAs in the VG contain the same infor- 25 
matk>n regardless of how many requests get on and 
off the WHEEL or how many revolutions it takes. This 
is sometimes called free wheeling. 

In reference to the above two request scenario, 
the following would happen. Request #1 comes in 30 
from the initial or post request policies, as shown in 
Figs. 18 and 19, and causes the WHEEL to start rol- 
ling by writing to VGSA 1 . Request #2 comes in and 
waits for the write to VGSA 1 to complete. When that 
write is complete, request #2 updates the menrK)ry 35 
version of the VGSA. When that is done VGSA 2 is 
written. When that completes VGSA 3 is written. This 
continues until VGSA 1 is the next one to be written. 
At this point request #1 is returned to the normal flow 
of the driver 64 since all VGSAs reflect the status 40 
char>ge requested by request #1. Now the write to 
VGSA 1 is started since that VGSA does not match 
the image of VGSA 8. When that write is complete 
request #2 is returned to the normal flow of the driver. 
This ts done because the WHEEL has rotated back to 45 
the position where request #2 jumped on. Now. 
because VGSA 2. the next to be written, and VGSA 
1 , the last written, are identical the WHEEL stops. The 
next request will start the WHEEL at VGSA 2. 

There is one other major WHEEL condition. It is so 
called p^gybacking. It is very likely, given the nature 
of disk drives, that several disk requests will fail in the 
same partitk)n. This will result in all of them wanting 
to change the state of that partitk)n. Depending on the 
length of time between these like feilures it would be 55 
possible for these like state change requests to get on 
the WHEEL at various positions. That is where pig- 
gybacking comes in. Before a request is put on the 

13 



WHEEL a check is made to see if a I8(e request is 
already on the WHEEL If one is found the new 
request is piggyt)acked to the one already there. 
When it comes tinte for the first request to get off of 
the WHEEL any piggybacked requests get off also. 
This allows the like state change requests to get off 
sooner and keeps the WHEEL from making any 
unnecessary writes. 

This Is not oontradk:tory to the second general 
rule because it states that all VGSAs must have been 
updated with a requesf s infonration before it is retur- 
ned. Piggybacking meets that requirement because 
all the piggyt>acked requests are doing the sanrta 
thing. Therefore, they can all get off the WHEEL at the 
same position regardless of where they jumped on. 
However, the initial request policies and the post 
request policies must be aware of any PR that is 
changing states. Othenvise. they may return a 
request eariy believing a partition to be already mar- 
ked stale when in fact there Is a prevtous request on 
the WHEEL doing Jusi that ThW second request must 
be piggybacked to the one cun-enUy on the WHEEL 
This additional Intermediate state may be quite long, 
relatively. A PR is conskfered in a changing state from 
the time the decision is made to change states until 
the time that request gets off of the WHEEL. Durir^ 
that time any I/O requests that are targeted for a par- 
tition that is changing states must foliow the rules 
stated in the stale PP processing discussk)n. 

We have seen how the WHEEL handles indivi- 
dual PRs that are changing the state of a single par- 
tition. But, there is another major aspect to the 
partition state methodology. That Is the configuration 
routines. These routines want to set the state of many 
partitions as the LV is extended or reduced while It Is 
open and in use. To accomplish this there must be a 
mechanism available and procedures defined that 
allow the configuration routines to: 

1) pause the WHEEL if it is rolling; 

otherwise keep it from starting 

ii) set the state of multiple partitk>rts 

iii) restart the WHEEL and wait for all the VGSAs 
to be updated 

This all must be done in a way that maintains LV integ- 
rity during the life of the operation, even across sys- 
tem crashes. 

Refer now to Fig. 20 for the following WHEEL pro- 
cedures. 

Volume Group Status Area 
START 

Called to change the status of a partition. This can 
be caused by two different mechanisms. First, a write 
failure in a mirror logical partition, LP. Second, an 
existing LP is extended, made wider, and valkl data 
exists in the original. In this case, the newly created 
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partitions are stale in relationship to the original. 
START always puts any new request on the hold list 
SA_HLD_LST. Then, if the wheel is not rolling, it will 
start it 

SACONT 

This block has several responsibilities. First it 
checks to see if a configuratton operatbn is pending. 
Since the VGSA wheel is free wheeling, once it is star- 
ted, a configuratbn operatton must wait until it, the 
wheel, gets to a stopping point before any changes 
can be made to the in-memory version of the VGSA. 
There is a stopping point for modifications between 
writes to the physical volumes. The wheel will not be 
started again unto the update to the menxtfy version 
of the VGSA by the configuration process is complete. 
Then the configuration process will restart the wheel. 
The second major function is to remove any requests 
on the holding list, SA_HLD_LST, scan the active list, 
SA_ACT_LST, for like operattons. If a like operation 
Is found, then associate this request with the previous 
one. This allows the new request to get off of the 
wheel at the same point as the request that is already 
on the active list If no IBce operatton is found on the 
active list, then update the memory version of the 
VGSA as per that request If a loss of quorum(to be 
described hereafter) is detected, then flush all the 
requests that are on the wheel. 

WHLADV 

Advance the wheel to the next VGSA. 

REQ STOP 

Starting at the head of the active list, 
SA_ACT_LST, check each request that has not just 
been put on the list If this is the wheel position where 
the request was put on the wheel then remove it from 
the active list and return it to if s normal path. After the 
active list has been scanned for completed requests 
a check is made to see if the menK>ry version of the 
VGSA has been written to the target VGSA on the PV. 
If the memory VGSA sequence number does not 
match the PV VGSA sequence number a write to the 
VGSA is initiated. 

NOTE: Once the wheel is started it wDl continue to 
write VGSAs until the menrYory VGSA sequence num- 
ber matches the PV VGSA that will be written next. 
Also, if the VGSA to be written next is on a PV that is 
missing the wheel will be advanced to the next VGSA 
and the active list is scanned again. When an active 
VGSA is finally found the wheel position of this VGSA 
is put in any new requests that were put on the active 
list by SA CONT. This indicates where they are to get 
off when the wheel comes back around to this posi- 
tion. Therefore, no request gets put on the wheel at an 



inactive VGSA. But requests can get off at a position 
that has gone inactive while the request was on the 
wheel. 

5 WRITE SA 

Builds request buffer and calls regular(Fig. 18), to 
write the VGSA to a PV. 

10 SAIODONE 

The return point for the request generated by 
WRITE SA. If the write failed the PV is declared as 
missing and a quorum check is made. If a quorum is 
95 lost due to a write failure SA lODONE only sets a flag. 
The actual work of stopping the wheel and flushing the 
active list is done in SA CONT. 

tOST QUORUM 

The volume group(VG) )ias lost a quorum of 
VGSAs. Sets flag to shutdown I/O through the VG. 
Return all requests on the wheel with errors. 

Various modifications may be made in the details 
25 of the preferred embodiment described atx>ve. 

Following are the high level procedures for the 
vark>us configuration management functions used to 
maintain a VG when they interact with the WHEEL 

30 EXTENDING A LV 

When extending any LV, even when there are no 
mirrors in any LP, the permanent state must be 
initialized in the VGSA. There is a general assumption 

35 when extending a LV that any partition being allocated 
is not currently in use and that the VGDAs have not 
been updated to indicate this partition is now allo- 
cated. It is further assumed that the write of the 
VGDAs is near the end of the overall operation so that 

40 the LV n^intains integrity if disaster recovery is 
needed. There are some conditions that can be 
implemented for this procedure and they wilt be men- 
tioned. 

i) Get control of the WHEEL. That means if it is rol- 
45 ling, stop it If it is not rolling, inhibit itfirom starting. 

ii) Modify the mennory version of the VGSA. 

iii) Restart or start the WHEEL. Wait for it to conv 
plete one revolution. 

NOTE: If the WHEEL was not rolling and there were 
50 no state changes in the memory version of the VGSA 

then there is no need to restart the WHEEL and wait 

for it to complete a revolution. 

NOTE: If the WHEEL was rolling and there were no 

state changes in the memory version of the VGSA 
55 then restart the WHEEL but there is no need to wait 

for it to complete a revolution. 

iv) Disable to INTIODONE. Link the new partition 
structures into the driver hierarchy. Re-enable 

14 
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interrupt level. Read/vvrite operations can now 

proceed on the PP if it Is active. 
NOTE: It is assunned the new partition structures con- 
tain ttie same pennanent state as was just Initialized 
InttieVGSA. 

REDUCING A LV 

Reducing an active LV must be done with care. In 
addition to ttie integrity issues that must t>e addressed 
there is ttie likelihood tiiat tiiere is I/O currently active 
in tiie PPs that are being removed. 

i) Get control of tiie WHEEL. That means if it is roi- 
ling stop it If it is not rolling inhibit it from starting. 

ii) Disable to INTIODONE. For all LPs being 
reduced a check must be made to ensure that at 
least one active PP is left in the LP before the 
reduction of any LP can proceed. The only excep- 
tk>n to ttiis is if all PPs are being reduced from the 
LP, thereby eliminating iL If any LP will be left with 
no active PP, ttie entire reduce LV operation 
should fail. For each PP being reduced, tum on 
the reducing flag in Its respective partition struc- 
ture. Also, and this is a big also since it has to do 
with integrity in the face of disaster recovery, IF 
the PP is a memt>^ of a LP with multiple copies 
AND not all of the PPs of ttiis LP are l>eing 
removed. AND the PP being removed is not stale, 
THEN the changing and stale flags must be tur- 
ned on also. IF ttie stale flag is turned on THEN 
the memory version of the VGSA must be 
updated also. THEN, re-enable the interrupt 
level. 

This is somewhat complicated but must be done 
this way to ensure that a PP does not come back 
active after system crash if the VGDAs don't get 
updated before the crash and a write may have 
occurred in ttie LP before the crash. If all PPs of 
a LP are being removed then the reduce flag will 
keep any new requests out of the LP. Then if the 
system crashes ttie data will still be consistent be- 
tween all copies upon recovery. 
Hi) IF the memory version of the VGSA was mod- 
ified THEN start/restart ttie WHEEL AND wait for 
it to complete one revolution. IF the memory ver- 
sion of ttie VGSA was not modified THEN release 
ttie inhibit on ttie WHEEL and restart It if it was rol- 
ling when we started. 

iv) Drain the LV. This means wait for all requests 
cunrently in the LV work queue to complete. 

v) Disable to INTIODONE. Now remove the par- 
tition structures from ttie driver hierarchy for the 
PPs be'mg removed. Re-enable interrupt level. 

vi) The VGDAs can be written. 

NOTE: If the VGDAs cannot be written and the reduce 
operation fails, the PPs will remain in their cuaent 
state of reducing and/or removed from the driver 
hierarchy and, therefore, will not be available for I/O. 



ADDING A PV TO AN EXISTING VG 

When a PV is added to ttie VG a VGSA on ttie PV 
must be added to ttie WHEEL Since a PV ttiat is 
5 being added cannot have any active PPs the acti- 
vation of ttie VGSA becomes easy. The only real con- 
cern is disaster recovery and even ttiat is simplified. 

i) The configuration routines must initialize the 
disk VGSA that will be activated. The configu- 

10 ration routines have two options they can lay 
down a VGSA with a content of binary zeros or 
they can get the current image of the memory ver- 
sion of the VGSA via the lOCTL The only critical 
issue is ttiat the timestamps must be zero to 

IS insure that this new VGSA will not be used by 
varyonvg if ttie system crashes before adding the 
PV is complete. 

ii) Get control of the WHEEL That means if it is 
rdling stop IL If it is not rolling inhibit It from start- 

20 ing. 

ill) Disable to INTlODbNE. Insert physical 
volume structure into volume group structure. 
IF ttie WHEEL was rolling THEN make It rotate at 
least back to ttie position just added. This may 

25 cause some extra writes to PVs that already have 
current VGSAs but, this will be so infrequent it 
should not cause any noticeable delays. 
If ttie WHEEL was not rolling THEN re-position 
the WHEEL controls to the position just before the 

30 newly added position. This is so we won't spin the 
WHEEL one whole revolution. The controls 
should be set up to make tiie WHEEL believe the 
new position is the last position to be written on 
this revolution. This way only the new VGSA is 

35 written and all the others cunrentiy on the WHEEL 
are not rewritten with the same data they already 
have. Since the memory versk>n of the VGSA has 
not changed due to the addition it is only import- 
ant that the cunrent versk>n t>e written to the new 

40 disk. It is not important to rewrite the same infor- 
mation on all ttie other disk VGSAs. 

iv) Re-enable to interrupt level. 
Start/re-start the WHEEL. 

NOTE: When the WHEEL stops or ttie requests 
45 from the configuration routines gets off the 
WHEEL the VGSA is now active and wilt be 
updated if a PP changes state. It is assumed the 
VGDAs will be written sometime after ttie VGSA 
is activated. Even if ttie writing of the VGDA on the 
50 new PV fails the VGSA will remain active unless 
there is a defined mechanism to come back down 
into the kemel part of LVM and remove iL 

v) Increment the quorum count in the volume 
group structure. 

55 

DELETING A PV FROM AN EXISTING VG 

Deleting a VGSA from ttie WHEEL is probably the 

15 
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simplest operation of them all. This is due to the fact 

the PV has no active PPs. 

i) The VGDAs should t>e updated to Indicate the 
PV is no longer in the VG. 
il) Get contrd of the WHEEL That means if It is 
roiling stop Hlf it is not rolling inhibit it from start- 
ing. 

iiO Disable to INTIODONE. Check the position of 
the WHEEL If it Is resting on the position to be 
removed then advance the controls and remove 
the phystca! vdume structure from the volume 
group stiucture. If the WHEEL is rolling and the 
next position to be written is the position to be 
removed adjust the WHEEL controls to skip it 
then remove the physical volume structure from 
the volume group structure. If the WHEEL was not 
rolling or the position was not in any of the des- 
cribed situations then just renrK>ve the physical 
volume structure from the volume group struc- 
ture. 

iv) If the WHEEL was rolling then restart It if it was 

not rolling then renrK>ve the inhibit 

There is no need to wait for one revolution of the 
WHEEL since no information in the VGSA has 
changed. This same procedure should be followed 
when deiet^g a PV with a status of missing. 

REACTIVATING A PV 

Reactivating a PV really means a defined PV is 
changing states from missing to active. This will hap- 
pen when a PV is returned or by a re-varyonvg oper- 
ation. The same procedure that is used for adding a 
PV can l>e used here. It is mentioned here just to rec- 
ognize that the condition exists and that it needs no 
special processing, outside the defined add PV pro- 
cedure, to reactivate the VGSA. 

VARYING ONAVG 

Varying on a VG is really a configuration and 
recovery operation as far as the WHEEL is concer- 
ned. Both of these are discussed later. But it is import- 
ant to note at this point that the WHEEL should not 
become active unt9 a regular LV has been opened in 
the VG. This means the WHEEL does not t>ecome 
active until after the varyonvg operation is complete. 

VARYING OFFAVG 

There is only one way to varyoff a VG but, there 
are two nxxJes, nonmal and forced. The only real dif- 
ference between them should be that the forced mode 
sets the VGFORCED flag. This flag tells the driver this 
VG is being forcefully shutdown. A force off will stop 
any new I/O from being started. In addition the 
WHEEL will stop, if it is rolling, at the completion of the 
next VGSA write and return all requests on the 



WHEEL with errors. If it is not roiling it will be inhibited 
from starting. The same procedure shotdd be followed 
for a normal varyoff but it should not encounter any 
problenis. This is because a normal varyoff enforces 

5 a NO OPEN LV strategy in the VG before it continues. 
So, if there are no open LVs in the VG there can be 
no I/O in the VG. If there is no I/O in the VG the 
WHEEL cannot be rolling. Only one procedure has 
been designed to handle both instances. 

10 \) If this is a normal varyoff then enforce NO OPEN 
LVs. If this is a force off then set the VGFORCED 
flag. 

ii) Quiesce the VG. This wBI wait until all currentiy 
active requests have been returned. This really 

1$ only applies to the force off mode since it may 
have I/O currentiy actwe in the VG. 
NOTE: During this time any write requests that have 
failures in any mirrored LP partition will have to he 
returned with an error, even if one partition worked 

20 correctiy. This is because the VGSA cannot be 
updated to Indicate a PP is no^ stale. Because the VG 
is being forced off the mirror write consistency cache 
(described hereafter) has been frozen just like the 
VGSA. Therefore, the disk versions of the minor write 

25 consistency cache remember that this wrfte was 
active. Now, when the VG is varied t>ack on, the mirror 
write consistency recovery operation will attempt to 
resynchronize any LTG that had a write outstanding 
when the VG was forced off. Since a mirror write con- 

30 sistency recovery operation just chooses a minror to 
make ttie nr^aster, it may pick the one that failed at the 
tinr^ of the forced varyoff. if this is so, and it is read- 
able, the data in that target area of the write w9l revert 
back to the state it was before ttie write. Therefore, an 

35 enror is retumed for a logical request that gets an error 
in any of its respective physical operations when the 
VG is being forced off and the VGSA cannot be 
updated to indicate a PP is now stale. See the dis- 
cussion on mirror write consistency for more details 

40 concerning correctness versus consistency. 

iii) The driver hierarchy for this VG can now be 
removed and the system resources retumed to 
the system. 

There are just a few more areas yet to cover con- 
45 ceming tiie VGSA. They are initial configuration, 
VGSA recovery, and. finally, a quorum of VGSAs. Ini- 
tial configuration will be covered firsL 

The driver assun^s the configuratbn routines will 
allocate memory for the menwry copy of the VGSA 
so and put a pointer to it in the volume group stiucture. 
The configuration routines will select a valid VGSA 
and load an image of the selected VGSA into tiiat 
memory VGSA before any mirrored LVs are opened. 
In addition, tiiere are several other fields in the volume 
55 group structure that will need to be initialized since 
there is a reserved buf structure and pbuf structure 
enrtbedded in the volume group structure. These 
structures are reserved for VGSA I/O operations only. 

16 
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This guarantees that there is a buf structure used t>y 
logical requests and a ptnif structure used by physical 
requests for any V6SA operation and therefore eluni- 
nates dead lock conditions possible if no pbuf struc- 
tures were avaBabte from the general pool. The 
configuration routines also control what PVs have 
active VGSAs and where they are located on the PV. 
These fields are in the physical volunne structure and 
must be set up also. 

The next topic concerning the VGSA is its recov- 
ery and/or validity at V6 varyon time. It is the configu- 
ration routines' responsibility to select a VGSA from 
the ones available on the PVs in the VG. This seleo 
tion process is exactly like the selection process for 
selecting a valid VGDA. It uses the Umestamps of Fig. 
7 that are present at the beginning and end of the 
VGSA. Each time the mantory verston of the VGSA is 
changed, those timestamps are updated to reflect the 
system time. The configuration routines must select a 
VGSA where the beginning and ending timestamps 
match and they are later In time than any other VGSA 
availabSe. It goes without saying that the entfre VGSA 
must ba read without errors. Once a VGSA Is selected 
the configuration routines rrajst use the state flags to 
initialize the driver partition structures. If the configu- 
ration routines find a VGSA that is out of date, relative 
to the selected one, or has read errors the configu- 
ration routines will rewiite(recover) the VGSA before 
the VG is allowed to have normal 1/0 activity. If the 
VGSA cannot be rewritten without ent)rs the PV must 
not be used and declared missing. 

The last VGSA issue to address is the quorum of 
VGSAs. Like the VGDAs there must be a quorum of 
VGSAs for the volume group to stay online. If a VGSA 
write fails, the PV Is declared missing, and all active 
VGSAs on that PV are therefore missing also. At this 
tinrte a count Is made of all currently active VGSAs and 
if that count is k>elow the quorum count set up by the 
configuration routines, the VG is forcefully taken 
offline. If the VG is forced offline all requests cunrently 
on the WHEEL are retumed with an erTor(SIO) if they 
did not have an error already. The WHEEL is stopped 
and wOl not accept any new requests. In order to reac^ 
tivate the VG it must be varied off and then varied back 
online. 

For the enabling code which implements the 
WHEEL see Fig. 21. 

MIRROR WRITE CONSISTENCY 

By far the largest problem with any system that 
has multiple copies of the sanrie data in different 
places is ensuring that those copies are mirror images 
of each other. In the preferred embodiment, with LVM 
there can be up to three copies of the same data 
stretched across one, two or even three physical 
volumes. So when any particular logical write is star- 
ted it is almost guaranteed that from that point on the 



respecth/e underlying copies will be inconsistent with 
each other if the system crashes before all copies 
have been written. Unfortunately, there is just no way 
to circumvent this problem given the nature of 

5 asynchronous disk operations. Fortunately, ail is not 
lost since LVM does not return the logical request until 
ail the underlying physteal operations are complete. 
This includes any bad Wock relocatton or stale PP pro- 
cessing. Therefore, the user cannot assume any par- 
te ticular write was successful until that write request is 
retumed to him without any error flags on. Then, and 
only then can that user assume a read wfll read the 
data that was just written. What that ntaans is, LVM 
wni concentrate on data consistency between mirrors 

IS and not data conrectness. Which in tum means, upon 
recovery after a system crash any data a logical write 
was writing when the system went down may or may 
not be reflected in the physical copies of the LP. LVM 
does guarantee that aft^ a system crash the data be- 

20 tween all active PPs of a LP wOl be cons'istenL It may 
be the old data or it may be the hew data, but all copies 
win contain the same data. This Is referred to as Mirror 
Write Consistency or MWC. 

There is one restriction on the guarantee of con- 

25 sistency. The volume group cannot have been 
brought online without a quorum. The user has the 
ab9ity to force a VG online even if a quorum of VGDAs 
and VGSAs are not available. If this forced quorum is 
used then the user accepts the fact that there may be 

30 data inconsistencies t)etween copies of a LP. 

Since the PPs may not be stale the normal resync 
could not t>e used. Altematively, a simple function to 
read from the LP followed directly by a write to the 
same logical address would t>6 sufficient to make all 

35 copies consistent It oouki run in the background or 
foreground, but in either case it would be time con- 
suming. 

Mirror write consistency is accomplished by 
remembering that a write has started and where it is 

40 writing to. It is very critical to remember that the write 
was started and where it was writing but less critical 
when it completes. This information is remembered in 
the minror write consistency cache or MWCC. So, if 
the system crashes, recovery of the PPs within the 

45 LPs being written becomes a function of interpreting 
the entries in the MWCC and issuing a mirror write 
consistency recovery(MWCR) I/O operation through 
the LV character device node to the affected area of 
the LP. These MWCR operations must be done t>ef- 

50 ore the LV is available for general I/O. The details of 
MWCC will now be described. 

There is one MWCC per VG and it is made up of 
two parts. The first part, sometimes referred to as the 
disk part and sometimes just part 1 , is the part that is 

55 written to the physical volumes. Therefore it is the part 
that is used to control the MWCR operations during 
recovery. Detafls of part 1 is discussed later. The sec- 
ond half of the MWCC is the memory part or part 2. 
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Part 2 of the MWCC is memory resident only. It comes 
into l>eing wtien the VG is Imught online. There are 
many aspects to controlling a cache such as hashing, 
ordering, freeing entries, etc, that have nothing to do 
with the recovery of the data in the mirrors and there- 
fore does not need to be written to disk or penmanent 
storage. That is why there are two parts to the MWCC. 
Part 1 or the disk part is written to the disk while part 
2 is not Each PV holds one copy of the ^4WCC. A 
more detaBed breakdown of each part of the MWCC 
follows. 

PARTI -DISK PART 

Part 1 of the MWCC is 512 bytes long, a disk 
block. PV disk block 2 is reserved for the copy. Part 
1 has 3 basic parts. 

i) A beginning timestamp 
il) Cache entries 

lii) An ending timestamp 

The timestamps are used during recovery to vali- 
date the MWCC and select the latest copy from all the 
available PVs in the VG. The timestamps are 8 bytes 
long each. There are 62 cache entries between the 
timestamps though they m^ht not all be actively being 
used by the VG. The size of the active cache is vari- 
able between 1 and 62 entries. The size of the active 
cache Is directly proportional to the length of time it will 
take to recover the VG after a system crash. This 
recovery time will be discussed later. The system is 
currently implemented to use 32 cache entries. Alter- 
nate embodiments could provide a command line 
optk)n so it is tuneable. 

Each part 1 cache entry has 2 fields. 
I) Logical Track Group(LTG) number 
The cache line size is a LTG or 128K bytes. It is 
aligned to LTG t>oundaries. For example, if the 
number of active cache entries in the MWCC 
were 32, there could be no more than 32 different 
LTGs being written to at any point In time in the 
VG. 

ii) LV mirror numt>er 

The mirror number of the LV that the LTG belongs 
in. 

PART 2 -MEMORY PART 

Each part 2 entry of the MWCC is made of several 
fields. Since part 2 is memory resident if s size is not 
important here. It is important to know that there is a 
direct one to one correspondence with the cache 
entries of part 1. Therefore if there are 32 cache 
entries being used in part 1, part 2 has 32 entries also. 

i) Hash queue pointer 

Pointer to the next cache entry on this hash 
queue. Cunrently 8 hash queue anchors exist in 
the volume group structure. 

ii) State flags 



NO CHANGE • Cache entry has NOT changed 
since last cache write operatton. 
CHANGED - Cache entry has changed since last 
cache write operation. 
5 CLEAN - Entry has not been used since the last 
dean up operatton. 

iii) Pointer to the con^espondlng part 1 entry 
Pointer to the part 1 entry that corresponds to this 
part 2 entry. 
10 hf) I/O count 

A count of the number of active I/O requests In the 
LTG that tills cache entry represents, 
v) Pointer to next part 2 entry 
Pointer to tiie next part 2 entry on the chain. 
15 vi) Pointer to previous part 2 entry 

Pointer to the previous part 2 entry on the chain. 
It is important to know that there are two parts to 
each cache entry, but, from this point a reference to 
a cache entry nraans the entity formed by an entry 
20 from part 1 and tiie corresponding entry from part Z 
The concept of the MWCb is deceivingly simple, 
but the implementation and recovery is not Part of 
this complexity is caused by the fact that a VG can be 
brought online without all of its PVs. In fact, after a 
25 system crash the VG can be brought online without 
PVs that were there when the system went down. 

There are two major areas of discussion concern- 
ing the MWCC. There Is the function of maintaining, 
updating, and writing the cache as the driver receives 
30 requests from the various system components. This 
Is the finont side of the operatk>n. It Is done so it is 
known what LTGs may t>e inconsistent at any one 
point in time. Then there is the backskJe of the oper- 
ation. That is when there has been a system crash or 
35 non-orderiy shutdown and tiie MWC caches tiiat 
reskJe on the PVe must k>e used to make things con- 
sistent again. So for now the focus of this discussion 
will t>e on the front skle of the operation, which also 
happens to t>e the frst step. 
40 The driver will allocate memory and initialize it for 

both parts of the cache when the first LV is opened in 
the VG. The driver assumes the MWCC that reside on 
the disks have already been used by the recovery pro- 
cess to make LTGs consistent and that those disk 
45 blocks can be written over without loss of data. The 
MWCR(miiTor write consistency recover) operation is 
really a read followed by writes. Since tiie MWCC is 
watching for writes the MWCR operations done at 
varyon time slip by without modifying the disk copy of 
50 the MWCC. 

As MWCC is an entity that must be managed as 
requests are received ttiere is a Mirror Write Consis- 
tency Manager(MWCM). The MWCM sits logically at 
the top of the scheduler layer between tiie scheduler 
55 and the strategy layer. It does not have a whole layer 
by itself since if s only concern is witii mirrored parti- 
tion requests but it is easier to understand if you view 
it tiiere. 
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As the initial request pdlcies receive requests 
they will make some initial chedcs to see if the request 
should be handed off to the MWCM. Following is a list 
of conditions that would cause a request not to be 
handed off to the MWCM. This does not mean every 
request that does get handed off to the MWCM is 
cached. The term cached is used rather loosely here. 
A request not cached in the classical sense but in 
the sense it must wait for infonration concerning if s 
operation to t>e written out to permanent storage l>ef- 
ore the request is allowed to continue. So the MWCM 
may return the request to the policy to indicate this 
request may proceed. 

i) The request is a read. 

ii) The LV options has the NO minror write consis- 
tency flag turned on. 

tii) The request specifically requests NO mirror 
write consistency via the ext parameter, 
iv) There is only one active partition in this LP and 
no resync is in progress in the partition. That 
could mean there is only one copy or all the others 
are stale. 

As mentioned earlier, each PV has a block reser- 
ved for the MWCC. But also mentioned was the fact 
that they may all t>e different The nr>ennory image of 
the MWCC is a global view of the LTGs in the VG that 
have writes currently active. But the disk copies of the 
MWCC are really only concerned with the information 
in the cache that concems writes to LPs that have PPs 
on that PV. For any given logical write request the 
MWCC will be written to the PVs where the PPs are 
located for that logical write request. 

As an example, if a LP has 3 copies, each on a 
different PV, a copy of the MWCC will be written to 
each PV before the actual data write is started. All the 
writes to the disk copies of the MWCC will be done in 
parallel. Even if one of the PPs is stale the MWCC on 
that disk must be written before the write is allowed to 
proceed on the active partitions. The MWCC must l>e 
written to PV with a stale mirror in case the PVs that 
contain the active mirrors are found to be missing dur> 
ing a varyon. Of course if a PV is missing the MWCC 
can not be written to it. This is nH>re of a recovery 
issue and will be discussed later. 

Once the MWCM receives a request there are 
several other test that have to be made before the final 
disposition of the request The MWCM will do one of 
the following for each request: 
NOTE: Remember these decisions are based on the 
cache line size of a LTG or 128K and not the size of 
the physical partition. Also, the tenm cache here deals 
exclusively with the memory version and is therefore 
global to any LV in the VG. 

IF (the target LTG is not in the cache) OR (the 
target LTG is in the cache AND it is changing) THEN 

i) Modify the cache - either add It to the cache or 
bump the 1/0 count 

ii) Move the cache entry to ttie head of the used 



list 

ill) Initiate writing the cacha to the PVs if needed 

hf) Put the request on the queues waiting for 

cache writes to complete 
5 v) When all cache writes omiplete for this request 

then retum the request to the scheduling policy 

IF (the target LTG is In the cache AND it is not 
changing ) THEN 

i) Bump the I/O count 
10 ii) Move the cache entry to the head of the used 

list 

ill) Retum the request to the scheduling policy 
There are some excepttons to the above bgic, 
however. Since the cache is a finite entity it is possible 

15 to fill it up. When that happens the request must go to 
a holding queue until a cache entry is available. Due 
to the asynchronous nature of the driver and lengthi- 
ness of disk I/O operations, which includes MWCC 
writes, a special feature was added to the disk drivers 

20 to help, but not eliminate, the problem. This feature 
allows the driver to tell the dis^ drivers to not HIDE the 
page. This means the driver can reference the MWCC 
even if the hardware is currently getting data from 
memory. Because of this the driver must take care to 

25 maintain hardware memory cache coherency while 
the MWCC is in flight to any PV. 

Therefore, In the first test if either condition is true 
and the MWCC is in flight, being written, the request 
will have to go to a holding queue unta the MWCC is 

30 no longer in flight When the last MWCC write com- 
pletes, i.e. the MWCC is no longer in flight, the 
requests on this holding queue can proceed through 
the MWCM. Remember, that the hardware w3l trans- 
fer the MWCC data to the adapter hardware tMjffer 

36 long before the actual write takes place. If the infor- 
mation in the cache is changed after this hardware 
transfer and k>efore receiving an acknowledgment that 
the data has been written, then a window exists where 
what is acknowledged to l)e on the disk is different 

40 from what is really there. In this case, if the request 
continues, the disk version of the MWCC may or may 
not know that this write Is active. This uncertainty can- 
not be allowed. 

In the first test if the second condition is tme and 

45 the MWCC is in flight then some might wonder why 
not just bump the I/O count and put the request on a 
queue waiting for cache writes. This condition comes 
about because an eariier request has caused the 
entry to be put in the cache and it has started the 

50 cache writes but all of them have not completed, as 
indicated by the changing state still being active. The 
problem is that when the first request started the 
cache writes, an association was make between it 
and all the PVs that needed to have their caches 

55 updated. At the point in time the second request 
enters the MWCM there is no way to know how many, 
if any, of these cache writes are complete. Therefore, 
it is not known how many associations to make for this 
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second request so that it can proceed when those 
cache writes are oompSete. So, this request is put on 
the cache hold queue also. This does not cause the 
request to lose much time because when all the cache 
writes are complete this hM queue is looked at again. 
When this is done, this request will get a cache it and 
proceed imntediately to bQ scheduled. 

In the above 2 test conditions, a staten^nt indi- 
cated the cache entry would be moved to the head of 
the list As with all cache syst&ns, there is an 
algorithm for handling the things in the cache. The 
MWCC uses a Most Recently Used/Least Recently 
Used(MRU/LRU) algorithm. When the MWCC is 
initialized the cache entries are linked together via 
pointers in part 2. These pointers refer to the next and 
prevkMis cache entry. This is a doubly linked ciroular 
list with an anchor in the volume group structure. The 
anchor points to the first cache entry or the nrtost 
recently used or modified. That entries next pointer 
points to the entry modified before the first one. This 
goes on until you get to the last entry in the cache, the 
least recently used or modified, and Hs next pointer 
points t)ack to the first entry, I.e. the same one the 
anchor points to. Now the previous pointer does the 
same thing but in reverse. So, the firet entries previ- 
ous pointer points to the last entry in the list, i.e. the 
least recently used entry. 

By using this type of mechanism several things 
are gained. First there is no free list When a cache 
entry is needed the last one (tRU) on the list is taken. 
If its I/O count is non-zero, then scan the cache 
entries, via the LRU chain, to find an entry with a zero 
I/O count If none are found the cache is full. This 
eliminates the need for countere to maintain the nunv 
ber of entries cunrently in use. 

Note, however, that when the I/O count is non-ze- 
ro in the LRU entry, the cache cannot be assumed to 
be full. Although the LRU entry is known to have been 
in the cache the longest, that is all that is known. If a 
syst^ has multiple disk adapters or the disk drivers 
do head optimizations on their queues the requests 
may come back in any order. Therefore, a request that 
may have been started after the LRU entry coukJ fin- 
ish before the LRU requests, thereby making a cache 
entry available in the middle of the LRU chain. It is 
therefore desirable to have a variable that would 
count the number of times a write request had to hold 
due to the cache being full. 

If the MWCM is scanning the hold queue when the 
cache fills up, the l\^WCM should continue to scan the 
hold queue looking for any requests that may be in the 
same LTG as any requests just added from the hold 
queue. If any are found they can be removed from the 
hold queue and moved to the cache write v^iting 
queue after incrementing the appropriate I/O count 

As mentioned earlier, there are hash queue 
anchors(8) in the volume group structure. In order to 
reduce cache search time, the entries are hashed 



onto these anchore by the LTG of the request This 
hashing is accomplished by methods commonly 
known in the prior art The entries on any partmular 
hash queue are fonA/ardly link via a hash pointer in 

5 part 2 of the cache entry. 

There are certain times when a cache dean up 
operation should be done. The most obvious is at LV 
close time. At that time the cache should be scanned 
for entries with a zero I/O count When an entry is 

10 found, it should be cleared and moved to the end of 
the LRU chain. Once the entire cache has been scan- 
ned, the PVs that this entry belongs to should also be 
written. Another time for a cache cleanup operation 
might be at the request of system management via an 

15 lOCTL. 

One other thing deserves to be mentioned here. 
What If the MWCC block on the disk goes bad due to 
a nr>edla defect? The MWCM will attempt to do a 
hardware relocatton of that block if this condition is 

20 found. If that relocation fails or an non-media type 
enror is encountered on a MV\^C write, the PV Is dec- 
lared missing. 

MWCC RECOVERY 

25 

We now know how the front side of the MWCC 
works. Remember the whole purpose of the MWCC is 
to leave enough bread crumbs lying around so that in 
the event of a system crash the mirrored LTGs that 

30 had write requests active can be found and made con- 
sistent This discussion will focus on the backside of 
the MWCC or the recovery Issues. 

Recovery will be done only with the initial varyon 
operation. This is due to the need to inhibit normal 

35 user I/O in the VG while the recovery operations are 
in progress. 

The recovery operations are the very last phase 
of the VG varyon operation. This is because the entire 
VG must be up and configured into the kemel before 
40 any 1/0 can take place, even in recovery operations 
where care must be taken to not allow nonnal I/O in 
the VG until all the LTGs in flight have been made con- 
sistent 

The first step in the recovery process is selecting 
45 the latest MWCC from all the PVs available. Once this 
is done, the recovery of the LTGs in the selected 
MWCC becomes a simple task of issuing mirror write 
consistency recovery(MWCR) I/O requests to the 
LVs/LPs/LTGs ttiat have an entry in the cache. This 
50 method is referred to as the fast path method 
because, the maximum number of recovery oper- 
ations is limited to the size of the MWCC. This in effect 
sets what the maximum recovery time for the VG is. 
In other words, using the selected MWCC do recovery 
55 on the LTG(s) If the parent LP has more than one non- 
stale PP copy. 

During tiiese MWCR requests, if a mirror has a 
write failure or is on a missing PV it will be merited 
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Stale by the driver, via the WHEEL 

Missing PVs do not add timd to the recovery oper- 
ation at varyon time, t>ut it may make mmrors stale that 
wp] need to be resynchronized later when the PVs 
come back online. There are 3 types of missing PVs. 
The first type is previously missing PVs, which are the 
PVs that are marlced as missing in the VGSA. These 
previously missing PVs could have t>een missing at 
the last varyon, or the driver could have found and 
declared them missing while the VG was online. It 
makes no difference. The second type of missing PVs 
is the newly found missing PVs. These PVs were 
found by the varyon operation to t>e not available at 
this time, but the PV status in the VGSA indicates they 
were online the last time the VG was online. This 
could be caused by the drive or adapter failure that 
caused a loss of quorum and the VG was forcefully 
taken offline. Another cause of newly found missing 
PVs is when the PV was not Included in the list of PVs 
to varyon when the varyonvg command was Issued. 
There is one other way fbr a PV to fall into the newly 
found missing category, and that Is when the MWCC 
cannot be read due to a read error of any kind, but the 
PV does respond to VGDA reads and writes. 

The previously missing PVs and the newly found 
missing PVs are combined into the final type of mis- 
sing PVs, the cunrently missing PVs. The currently 
missing PVs are the ones of importance to the cunrent 
discussion. After the first phase of the recovery 
scenario, i.e. select a MWCC from the available PVs 
and do fast path recovery, the second phase is done. 
The second phase is done only if there are any cur- 
rently missing PVs. 

Actual recovery of LTGs on missirtg PVs may be 
impossible for a couple of reasons. The biggest of 
these reasons is the PV is missing and therefore will 
not accept any I/O. Another concern of missing PVs 
is when all copies of a LP/LTG are wholly contained 
on the missing PVs. This is a problem because there 
is no information abouX these LTGs ava9able to the 
recovery process. It Is not known if write I/O was in 
flight to these LPs/LTGs, 

Therefore, it must be assumed there was I/O out- 
standing and the recovery process must do the right 
thing to insure data consistency when the PVs are 
brought back online. The correct thing to do in this 
case is mark ail but one non-stale mirror stale, i.e., for 
each LP in the VG, if the LP is wholly contained on the 
currently missing PVs, then mark all but one of the 
minrors state. When the PVs come back online the 
effected LPs will have to be resynchronized. 

A data storage system has been described which 
has an improved system throughput 

A storage hierarchy for managing a data storage 
system has also been described. 

A method of improved system throughput in a 
computer system where multiple copies of data are 
stored to aid in error recovery has also been des- 



cribed 



Claims 

5 

1. A method for managing a plurality of data storage 
devices associated with a computer system and 
having a ftrst physical vdume and subsequent 
physical volumes and being partitioned into one 

10 or more logical volumes, each of said lexical 
volumes being further partitioned into one or more 
logical partfttons each of which comprises one or 
more physical partitions of sakJ storage devices, 
said method comprising the steps of: 

f 5 determining status information for each of 

said physical partitions and recording said status 
information in a menK>ry of said computer system; 

recording said status infomnation in a 
status area existing on each of said data storage 

20 devices; 

creating updated status infonmation when 
a write request is generated for any of said physi- 
cal partitions; 

updating said status area on said first 

25 physical volume with said updated status infor- 
mation; and 

updating said status area of each sut>se- 
quent physical volume within said storage 
devices in succession with said updated status 

30 informatbn, wherein if a second or subsequent 

write request is received prior to completing an 
update of each of said storage device status 
areas as a result of a prior write request, said 
status infonmation is updated in said computer 

35 memory and used in updating said next succeed- 
ing physical volume status area. 

2. A method as claimed in claim 1 wherein each of 
said physical partitions corresponding to a given 

40 logical partition contains duplicate data infor- 
mation. 

3. A method as claimed in daim 2 wherein if a sut>- 
sequent request to change status information is 

45 received prior to completing an update of each of 

said storage device status areas as a result of a 
prior status change, said subsequent status 
change is recorded while recording status infor- 
mation resulting from said prior status change. 

50 

4. A computer system including means for manag- 
ing a plurality of data storage devices associated 
witii said computer system and having a first 
physical volume and subsequent physk:al 

55 volumes and being partitioned into one or more 
logical volumes, each of said logical volumes 
being further partitioned into one or more logk^al 
partitions each of which comprises one or more 
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physical partitions of said data storage devices, 
said managing means comprising: 

means for maintaviing status information 
for each of said physical partitions in a memory of 
said computer system: 5 

recording means for recording said status 
information in a status area existing on each of 
said data storage devices; 

means for creating updated status infor- 
mation when a write request is generated for any io 
of said physical partitions; 

first update means for updating said status 
area on said first physical volume with said 
updated status information; and 

subsequent update means for updating is 
said status area of each sut>sequent physical 
volume within said data storage devices in suc- 
cession with said updated status information, 
wherein if a second or suk>sequent write request 
is received prior to completing an update of each 20 
of said data storage device status areas as a 
result of a prior write request, said status infor- 
mation is updated in said computer memory and 
used in updating said next succeeding physical 
volume status area 25 



30 



35 



40 



45 



50 



55 



22 



BNSOODO <BP 0482853A2.I > 



EP0482 853 A2 



5* 



APPLICATIONS 


2 


APPUCATION DEVELOPMENT PRODUCTS 23 


KBMEL 


24 







PROCESSOR 



12 



15 



5' 



) 



MEMORY MANAGER 




UNIT 


13 




SYSTEM RAM 


14 



I/O CHANNEL 
CONTROLLER 



16 




DISK 




DRIVE 


17 







OTHER I/O 



20 



FIG. 1 



23 



EP 0 482 853 A2 



/ (ROOT) 



OOA 



00A2 



OOB OOC 



OOE 
OOD — 



00A1 00D1 



0002 00D3 



00A21 00A22 



00D4 



DIRECTORY FILES ARE NOT UNDERUNEO 



00D42 00D41 



FIG. 2 




FIG. 3 

24 



GNSOOCID: <EP (M82e5aA2J.> 



EP0 482 853 A2 



32 



DISK DISK DISK DISK DISK DISK 


DISK 


DISK OISkI 









SURFACE 1 I SURFACE 2 — 



34 



J 



L 



TRACK 0 



TRACK 1 



TRACK 3 



42 



TRACK 2 
^42 ^42 J ^ 



TRACK Nj 



BYTE1 



BYTE 2 



40 ^40 



42 



42 



SECTOR 1 


SECTOR 2 


M 1 









SECTOR N I 
^4, 



> 

I BYTEN I 



40 



FIG. 4 



RESERVED AREA 
LOGICAL VOLUME MGR AREA 



USER AREA 
BAD BLOCK POOL 



FIG. 5 



8 biks . 



Logical Volume Mgr Area 

64 bIks ^ 



VGSA 



VGDA 



vgsa vgda 



VGSA = Volume Group Status Area (Primaiy Copy) 
VGDA = Volume Group Data Area (Primary Copy) 
vgsa = Volume Group Status Area (Secondary Copy) 
vgda = Volume Group Data Area (Secondary Copy) 

FIG. 6 

25 



EP0482 853 A2 




i. 



CM 



i 



cn 
cd 

□l a> 

iS 
CO 



CO 
CO 



Q. 3 

"(5 



CO 



T 



CO 
CO 

CL. <D 
CO 



c CO 
<i> E 



CO 

o 



1^ 



(O 
-Q 

i 



E 



CO 

o 
"en 

II 



CO 
CD 

OJ 



-3? 
c5 



CO 

E 

Cd 



CO 

CD 
CM 

in 

; 




CO 
CD 



CO c= 
O LU 

S E 



a> 
(O 

CD 





CO 




.9? 


Q_ 




Q- 


En 








CD 




"O 


> 


Ct$ 




CD 







t 
▲ 



CL c 
LU 



03 

Q- CD 



CO 
o Q5 

Q. ^ 



Q_ CD 



CO 

g 
11. 



26 



ensoacm; <EP_04szes3A2j.> 



EP0 482 853 A2 



000 



008 010 018 020 028 030 038 



LPO---: 



LP1 - 



FIG. 9 



Logical Volume 
Manager Pseudo <J 
Device Driver 



64 



72 



File System 
I/O Requests 



68 



Strategy Layer 



65 



Scheduler Layer 



66 



Physical Layer 



67 




^ • • • X • • • 













- 80 ' 




- 80 




- 80 



84. 



\y 

Volume Group 



27 



FIG. 13 



Physical 
Volumes 



ISOOCIO: <EP 04a2eS3A2.l.> 



EP0482853 A2 




EP0482853 A2 




The WHEEL 
FIG. 16 

29 

SNSOOCID <EP 0482eS3A3 I > 



EP0482853A2 



PBUF Structure 
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Flags - Basic information - READ/WRITE, Buffer Busy, Error 
Indicator. 

Pointers - Used to link tfiese PBUFs onto various chains to 
control the flow. 

lODONE - PTR to a function to hand the PBUF to when request is 
complete, i. e. the lower level disk device drivers call the function 
pointed to by this field to retum the request back to LYM when 
they are finished with the request. 

Device - Physical Device where the transfer will be done. 

Disk block # - Disk Block # where the transfer is to start from. 

Memory Address - Memory Address where the data is to be 
transferred to or from. 

Xfercount - Number of bytes to transfer. In the case of LVM this 
must be a multiple of disk blocks (512 bytes). 

Error Type - When the error indicator is on (True) in the flags 
field. This field indicates the type of error. Example Media Error, 
Invalid Request ... 

Residual Xfercount - If an error occurred on a Xfer. This field 
contains the number of bytes that were NOT transferred. 

PTR to Original Request - LVM receives requests from layers 
above the strategy layer. These logical requests are translated 
into one or more physical requests (PBUFs). When all physical 
requests for a given logical request are complete the logical 
requests can be returned to its originator. This is a backward link 
to the originating logical request. 

PTR to Scheduling Routine - Physical Requests are returned from 
the disk drivers, via the lODONE field, to the physical 

FIG. 17a 
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layer of LVM. The physical layer has responsibility for bad block 
processing. If the request is finished the physical layer will return 
the request to the scheduling layer via this pointer. The 
scheduling layer makes decisions concerning what must be done 
next to compute the logical request. 

Mirror - Mirror number associated with this PBUF, 0, 1 , or 2. 

Mirror Avoid - Bit mask (3 bits) indicating which mirrors are to be 
avoided or not used to satisfy the logical request, i. e. the mirror is 
broken, or on a physical volume that is not available; 

Mirror Bad - Bit mask (3 bits) that indicate which mirrors have had 
failures or are broken. 

Mirror Done - Bit mask (3 bits) that indicate which mirrors have 
completed the transfer. 

SW Retry - Software Retry count; how many times this block has 
had a software relocation attempted, 1 or 2. 

Type - Type of PBUF, used in error processing by the WHEEL 
and Bad Block Processing. Tells the WHEEL if this is a Make 
Stale PP Request, Mark PV Missing, or Make PP Fresh. 

Bad Block Operation - Used to control updating the bad Block 
directory that reside in the reserved area of all physical volumes. 

WHEEL Stop - The position this PBUF is to get off of the WHEEL 
when it is on the WHEEL. 



FIG. 17b 
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r VGSA.H */ 

ifndef H_VGSA 
#definelH_VGSA 

r 

COMPONENT.NAME: (SYSXLVM) Logical Volume ManagerDevice Driver - vgsa.h 



/ 

include < sys/param.h > 
#inciude<sys/dasd.h> 

r 

* LVDD internal macros and defines used by Volume Group Status 
*Area(VGSA) logic 

V 

#define RTN_ERR 1 T Return requests from the VGSAV 

r wheel with ENXIO errors 7 

#define RTN_NORM 0 TRetum requests from the VGSAV 

/* wheel without explicitly V 
/* turning on the BJERROR flag V 

#define VGSA_BIJ( 8 /*VGSA length in disk blocks 7 
#define VGSA_SIZE (VGSAJLK * DBSIZE) /* VGSA length in bytes 7 
#define VGSA_BT_PV 127 T VGSA bytes per PV 7 

r 

* This structure limits the number of Physical Partitions(PP) that can be 

* present in the VG to 32,512. The stalepp portion is divided equally 

* between the 32 possible PVs of the VG. This gives each PV 127 bytes 
*or1016PPs. 

7 

FIG. 2M 
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Struct vgsa_area{ 

struct timestrucj b_tmstamp; T Beginning time stamp V 

reitperPV V 
ulong pv^missingff[(MAXPVS + (NBPL- 1)) / NBPL]; 

r Stale PP bits V 
uchar slaleppff[MAXPVSlff[VGSA_BLPVl; 
char pad2ff[12];r Padding V 
structtlmestrucj ejmstamp; /* Ending time stamp / 

1; 



* Macros used to set/clear/lest pv_missing and stalepp bits in a vgsa_area 

* struct The ptr argument is assumed to be a ptr to the vgsa_area 

* structure. All other arguments are assumed to be zero relative. 

* This allows LVM library functions to use these macros. 

* NOTE these macros will not work if the max number of PVs per VG is 

* greater than 32. 
*/ 

#define SETSA_PVMISS(Ptr, Pvnum) \ 

((Ptr)-> p/_missingff[(Pvnum)/NBPL] |= (1 < < (Pvnum))) 
#define CLRSA__PVMISS(Ptr, Pvnum) \ 

((Ptr)-> pv_missingff[{Pvnum)/NBPLl &= ('(1 < < (Pvnum))) 
#define TSTSA^PVMISS(Ptr, Pvnum) \ 

((Ptr)-> pv_missingffl(Pvnum)/NBPLl & (1 < < (Pvnum))) 

#define SETSA_STLPP(Ptr, Pvnum, Pp) \ 

((Ptr)-> syeppff[(Pvnum)]ff[(Pp)/NBPB] |= (1 < < ((Pp) % NBPB))) 

#define CLRSA_STLPP(Ptr, Pvnum. Pp) \ 

((Ptr)-> syeppff[(Pvnum)]ff[(Pp)/NBPB] &= C (1 < < ((Pp) % NBPB)))) 

#define XORSA_STLPP(Rr. Pvnum, Pp) \ 

((Ptr)-> syeppff[(Pvnum)]fft(Pp)/NBPB] = (1 < < ((Pp) %NBPB))) 

FIG. 21-2 
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Mefine TSTSA STLPP(Ptr. Pvnum, Pp) \ 

((Ptr)->"syeppffl{Pvnuin)lffl(Pp)/NBPBl &= (1 < < ((Pp) % NBPB))) 

r 

' Macros used to set/retrieve the logical sector number and sequence number 

* associated with each VGSA. 
V 

^define GETSA_LSN(Vg, Wx) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1].lsn) 

«define SETSA_LSN(Vg, Idx, Newlsn) \ 

((Vg-> pvolsff[(ldx) > > 11-> sa_areaff[(ldx)&1].lsn = (Newlsn)) 

#define GETSA_SEQ(Vg, Idx) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa__areaff[(ldx)&1].sajseq_num) 

#define SETSA SEQ(Vg, Idx, Seq) \ 

(((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1l.sa_seq_num) = (Seq)) 

#define NUKESA(Vg, Idx) \ 

((Vg-> pvolsff[(ldx) > > 1l-> sa_areaff[(ldx)&1].nukesa) 

#define SET,NUKESA(Vg, Idx, Rag) \ 

((Vg-> pvolsff[(ldx) > > 1]-> sa_areaff[(ldx)&1l.nukesa = (Rag)) 

r 

* The following stnjctures are used by the config routines to pass 

* infonnation to the hdjsa_config() function for stale/fresh PP and 

* install/delete PV processing. A pointer to an an'ay of these 

* stmdures is passed as an argument 
V 

r 

* An array of these stmctures is tenninated with both the pvnum and pp 

* equalling -1. 

FIG. 21-3 
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struct cnfgj)p_state{ 
short pvnum; /* PV number the PP is on V 

short pp; r PP number to mark stale/fresh */ 

int ppstate; /* state to mark PP stale/fresh V 



r 

* passed in as arg when a CNFG.EXT request is done. 
V 

$tmctsa_ext{ 

struct Ivol *ldvj)tr, r ptr to Ivol struct being extended V 

short nparts; /* number of copies of the Iv 7 

char isched; . T scheduling policy for the IvV 

char res; > padding V 

ulong nblocks; ^length in blocks of the IvV 

struct part **new_parts; /* ptr to new part stnjct list V 

int old_numlps; /* old number of logical partitions on Iv V 

int old_nparts; r previous number of partitions on Iv V 

int ef^or Terror to return to library layer*/ 

struct cnfgj)p_state *vgsa; /* ptr to pp info structure */ 



r 

* passed in as arg when a CNFG_RED request is done 
V 

struct sajed{ 
stmct Ivol *lv; r ptr to Ivol struct being reduced V 
short nparts; /* number of copies of the Iv*/ 
char isched; /* scheduling policy for the Iv*/ 

char res; /^reserved area*/ 
ulong nblocks; Tlength in blocks of the iv*/ 
stmct part **newparts; T ptr to new part stnjct list*/ 
unsigned short min_num; T minor number of logical volume */ 
int numips; T number of Ips on iv after reduction*/ 
int numred; /* number of pps being reduced */ 

FIG. 21-4 
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int error; /* error to return to library layer V 

struct extred j)art list; T list of pps to reduce V 

1; 
r 

* install PV infomiation for VGSA config routine 
*/ 

stnjctconfg_pvJns{ 
struct pvol* pvol; rPVtoinstall or remove */ 
short qrmcnt: /* new VG quonjm count V 
short pvjdx; T index into vg's pvol array V 

}; 
r 

* delete PV infomiation for VGSA config routine. Also used for remove PV 

* and missing PV (qmicnt will not be used for missing PV) 

stnjctcnfgj)vjdel{ 
struct pvol * pvjitr; /* pointer to pvol struct to remove */ 
struct part * lpj)tr; 1* pointer to DALV's LP struct to zero 7 
short lpsize;rsizeofDALVsLP7 
short qrment; I* VG's new quorum cnt once this PV is deleted */ 

}; 
r 

* infomnation to add/delete a VGSA from a PV 

V 

stnjctconfgj)v_vgsa{ 
stnjct pvol * pvj)tr; /* pointer to pvol stnjct to remove V 
daddrj sajsnsff[2]; r LSNs for VGSAs added or 0 if deleted or 

if a copy not being added V 
short qrment; /* VG's new quomm cnt once this PV is deleted V 

}; 
r 

* The following defines are used by the VGSA write operationa These 

* defines indicate what action the pbuf is requesting. It is stored 
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* in the pbjype field of the pbuf. 

V 

#define SA_PVMISSING 1 /*PV missing type pbuf / 

Udefine SA.STALEPP 2 T Stale PP type pbuf V 

#defme SA_FRESHPP 3 /* Fresh PP type pbuf V 

Udefine SA_CONFIGOP 4 /hd_config function type pbuf V 

#define SA_PVREMOVED 5 /*PV removed type pbuf 7 

r 

*The following defines are used by the config routines to set up 

* the cnfgj)p_state fields 
V 

#defineSTALEPP 1 
#define FRESHPP 0 
#define CNFG.STOP -1 
#defineCNFG_NEWCOPY -1 

#endifr H VGSA7 



r L1BLVM.H 

* C0MPONENT_NAME: (liblvm) Logical Volume Manager 

* © COPYRIGHT Intemational Business Machines Corp. 1988, 1990 
*AII Rights Reserved 

V 

#ifndef_H_LlBLVM 
#define_H_UBLVM 

include < lvm.h > 
include < sys/dasd.h > 
include < sys/bootrecord.h > 

ifndefTRUE 
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#defineTRUE 1 
#endif 

#ifndef FALSE 

#define FALSE 0 

#endif 

MefNULL 

#defineNULL ((void')O) 
#endif 



* Error codes used internally by the library. These are noteturned 

* to the user. NOTE that these values start at 500 so they wiihot 

* conflict with the error values in lvm.h which are returned to the 

* user. 
*/ 



#define LVM_BBRDERR -500 

#define LVM_BBWRERR -501 
#define LVM_PVRDRELOC -502 

#define LVM_BBINSANE -503 



I* read error on bad block directory 
/* write error on bad block directoiy 
r put PV in read only relocation 
I* bad block directoiy is not sane 



r 

* General defines 
V 

#define LOCK_ALL 0 
#defineCHECK_MAJ 1 
#define NOCHECK 0 
#define FIRSTJNDEX 0 
#defineSECJNDEX 1 
#defineTHIRDJNDEX 2 
#define NO_COPIES 0 
#define ONE_COPY 1 
#define LVMJNAME 72 
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#defineLVM_NOLPSYET 0 

#defineLVM_REDUCE 1 

#defineLVM_EXTEND 2 

define LYM_nRST 1 

#defineLVM.SEC 2 

#delineLVMJHIRD 3 

UdefineLVM.LASTPV 0 

«defin6LVM_CASEQEN 1 

#defineLVM_CASE2T01 2 

#defineLVM_CASE3T02 3 

#defineLVM_GETSTALE 1 

«defineLVMJOSTALE 2 

#defineLVM_PVNAME 1 

#defineLVMJGNAME 2 

#defjne LVM^LVDDNAME "hd_pin" 

#define LVM_KMIDFILE "/etcA^g/Ivdd.kmid" 

#defineLVM_STREAMRD T 

#defineLVMJTREAMWR V 

#defineMAPNOTOPEN -1 r Mapped file is not openV 

r 

* GENERAL LVVALUES 

V 

#define LVMJNmALLPNUM 0 



#define LVM.LVMID 0x5F4C564D r LVM id field = "_LVM" V 
#define LVM_SLASH 0x2F T hex value for ASCII slash */ 
#define LVM_NULLGHAR W /* null character V 
#define LVM_DEV Vdev/" /* concatenate to device names */ 

r 

#define LVMJXTNAME (sizeof (LVN_NAMESIZ) + sizeo(LVM_DEV) + 1 ) 
V 

r size of extended device names V 
#define LVMJXTNAME 72 
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#defins LVM.ETCVG 7etc/vs/vg° 

r concatenate to VG id for map filename V 



#define m 
#de{ine m 



RELOC_LEN 256 
RELOCftMSK 0x8 



#define m 
#defineLVy 
Mne m 
#d@fin@ LV^ 
#define m 
#define m 
#define m 



FR0MBEGI^5 
NOVGDALSN 
FILENOTOPN 
WRITEDA 1 
DALVMINOR 
■PRMRY 0 
SCNDRY 1 



-1 
-1 



1STPV 1 



#define VM 



4 

LVM 



#define m 



#define m 
#define m 
#define im 

me m 

#define LVM 
#define LVM' 
#d9fine LVM' 
#define LVM" 
#define LVM' 
#define LVM; 

r 

" Macros 



MIDBEGENO 
ZEROETS 2 
GREATER 1 
EQUAL 2 
"LESS 3 
TSRDERR 
BTSEQETS 
BTSGTETS 
DAPVS_nL1 f 
'TTLDASJPV 2 
DAPVS_TTL2 2 
'TTLDASJPV 3 
DASPERPVGEN 
BBGHGRBLK 1 
BBCHGSTAT 2 
STRCMPEQ 0 
BBRDONLY 1 
"BBRDINIT 2 
BBRDRECV 3 
BBPRIM 1 
BBBACK 2 



f length in blocte of BB reloc pool 7 

r mask value to check BB relocation 7 
r seek value is offset from beginning 7 
r no desc LSN defined for this entry 7 
/* file is not currently open 7 
/"writeVGDAforthisPVV 
0 t minor number of descriptor area LV 7 
r index for primary VGDA/VGSA ISH 7 
r index for secondary VGDA/VGSA LSN 7 
0864 Tpemiissions for open of mapped file 7 
f PV number of first physical volume 7 

0 r write order of beginning/middle/eni 

1 r write order of middle/beginning/eni 
r zero end timestamp, then write b/m/e 7 
r first timestamp greater than second 7 

f first timestamp equals second 7 
r first timestamp less than second 7 
r read error on timestamp 7 



7 
7 



.EQUAL r beigin timestamp = end ts 7 
GREATER T begin timestamp > end ts 7 
r total of 1PV with VGDA copies 7 
r total number VGDA copies on 1 PV 7 
r total of 2 PVs with VGDA copies 7 
r total number VGDA copies on 2 PVs 7 
1 r number VGDAs per PV for general case 7 
f change relocation block of bad block 7 
r change status field of bad block 7 
r string compare result of equal 7 
r read a bad block directory 7 
r read and initialize a bad block directory 7 
r read and recover bad block directories 7 
r use the primary bad block directory 7 
r use the backup bad block directory 7 
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#defineLVMJIZrOBBND(Size) ((((Size) + DBSIZE-1)/ DBSIZE) ' DBSIZE) 

#define LVMJBDIRLEN(BbJdr) (LVM_SIZTOBBND(sizeof (struct bbjdr) +\ 
(Bbjidr->num_entries * sizeof (struct bb_entry)))) 

#defineLVM_MAPFN(Mapfn,Vgid) \ 

(sprintf ((Mapfn), "%s%8.8X%8.8X', LVM_ETCVG. \ 
(Vgid) -> word1 , (Vgid) -> word2)) 

#define LYM.BUILONAME(name,maj.min) \ 

(sprirltf((name);%s%c%c%s%d%c%d^LVM_DEV^^^^ 

#define LVM_BUILDVGNAME(name,maj) \ 

(sprirTlf({name);%s%c%c%s%d:LVM_DEV;j;j;vg\(mcy 

#define LVM^PPLENBLKS(Ppslze) (1 < < ((Ppsize) - DBSHIFT)) 

#define LVM_PSNFSTPP(Lvmareastart, Lvmarealen) \ 
{TRK2BLK (BLK2TRK (Lvmareastart + Lvmarealen - 1 ) + 1 )) 

r 

* The following is the file header structure that gives indexes and 

* general information about the volume group descriptor area stnjctures 
V 

struct daJnfo{ 

daddrj dalsn; 1* logical sector number of VGDA copy V 
struct timestnjcj ts; T timestamp of this VGDA copy 7 

}; 

stnjctfheader( 

long vginx; T byte offset for vg header V 
long Ivinx; T byte offset for Iv entries */ 

long pvinx; /* byte offset for pv entries V 

long endpvs; /* byte offset for end of last PV entry V 

long namejnx; r offset for the name area V 

long trailinx; r byte offset for the vg trailer V 

FIG. 21-10 
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long major_num; 1* major number of volume group V 
long vgdajen; /* length in blocte of the VGDA V ' 
char vgnameff[LVM_NAMESIZl; /* name of volume group V 
long quorumjcnt; r number of vgdas needed for vaiyon V 
longpadi; TpadV 

short int num jdesclps; I* number of LPs per PV for the VGDA LV V 
struct pvinfo{ 

char pvnameff[LVM_NAMESIZ]; /* PV name V 

struct unique_id pvjd; /* id of physical volume V 

long pvinx; T byte offset to PV header V 

devj device; /* major/minor number */ 

short pad2; /* pad V 

short pad3; TpadV 

struct dajnfo da ff[LVM_PVMAXVGDAS]; T info on this VGDA copy 7 
1 pvinfofpM_MAXPVS]; /* infomiation about each PV V 



r II. Volume Group Descriptor Area V 

struct vg_header 

stnjct timestrucj vgjmestamp; /* time of last update V 

struct uniquejd vg_id; T unique id for volume group V 

short numivs; T number of Ivs in vg*/ 

short maxivs; T max number of Ivs allowed in vgV 

short pp_size; /* size of pps in the vg */ 

short numpvs; /* number of pvs in the vg */ 

short total^vgdas; /* number of copies of vg V 

r descriptor area on disk V 
short vgda_size; 1* size of volume group descriptor V 

r area V 

}; 

FIG. 21-11 
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struct lv_entries 

' short Ivname; /*nameofLVV 
short resi; /* reserved area*/ 
long maxsize /* maximum numt)er of partitions allowed V 
char Iv^state; T state of logical volume*/ 
char mirror; rnone,single, or double */ 
short mirror_policy;/* type of writing used to write*/ 
long numjps; /^ numlwr of logical partitions on the Iv*/ 
rbaser/ 

char permissions; /* read write or read only / 
char bbjelocation; r specifies if bad block */ 

/* relocation is desired */ 
char write_verify; /'verify all writes to the LV*/ * 
char mirwrt^consist; /* minror write consistency flag */ 
long res3;~ /* reserved area on disk*/ 
double res4; /*resen/ed area on disk*/ 

}; 



stnx;tpv_header 

' stmctuniquejd pvjd; T unique identifier of PVV 
unsigned short pp.count; /* number of physical partitions*/ 

ronPV*/ 

char pv_state; /* state of physical volume */ 

char resi; /* reserved area on disk*/ 

daddrj psnj)art1 ; /* physical sector number of 1st pp */ 
short pvnum_vgdas; /* number of vg descriptor areas */ 

/* on the physical volume'/ 
short pv__num; /*PV number*/ 

long res2; T resented area on disk */ 

1: 

stnjctpp_entries 

{ 

FIG. 21-12 
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1: 



short Ivjndex; /* index to Iv pp is on V 
short res 1 ; 1* reserved area on disk V 
long ip jium; I* log. part number */ 
char copy; T the copy of the logical partition V 

f* that this pp is allocated for V 
char ppjtate; T current stale of pp 7 
char fst altjvol;/* pv where partition allocation for*/ 

/* first mirror l)eginsV 
char snd ait_vol;rpv where partition allocation for*/ 

/* second mirror begins*/ 
short fst_altj)art; r partition to begin first mirror */ 
short snd^alt j)art; /* partition to begin second mirror */ 
double resj; /* reserved area on disk */ 
double res_4; /* reserved area on disk */ 



stmct namelist 

' char nameff[LVM_M/0(LVS]ff[LVM_NAMESIZl; 

1; 

stmct vgjrailer 

( 

stmct timestrucj timestamp; /* time of. last update*/ 



1; 



double resj 
double res_2 
double res 3 



/* resented area on disk*/ 
/* reserved area on disk*/ 
r reserved area on disk*/ 



/* 

* The following structures are used in lvm_varyonvg 
*/ 

stmct da_saJnfo 

FIG. 21-13 
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r Structure to contain timestamp information about a 
volume group descriptor or status area V 

stmct timestmcj ts_beg; 
r beginning timestamp value V 
struct timestrucj ts_end; 
r ending timestamp value V 
short inttsjstatus; 

r indicates if read error on either timestamp, or if both good 
indicates if beginning ts equal or greater than ending ts V 
short intwrt_order; 

r indicates order in which to write this VGDA or VGSA copy V 
short intwrt_status; 

r indicates whether this VGDA or VGSA copy is to be written */ 

1; 

stmct inpvsjnfo 

r information structure for PVs in the user's input list V 

{ 

struct 

1 

intfd; 

/* file descriptor for open of physical disk 7 
struct unique Jd pvjd; 
r the unique id for this physical volume V 
devj device; 
r the major/minor number of the physical volume V 
daddrjda j)snfpM_PVMAXVGDAS]; 
r physical sector number (PSN) of beginning of the 
volume group descriptor area (primary and secondary 
copies), orOif none/* 
daddrj sa_psn ff[LVM_PVMAXVGDASl; 
r PSN of beginning of the volume group status area 

(primary and secondary copies), or 0 if none I* 
daddrj relocj)sn; 
r PSN of the beginning of the bad block relocation 
pool V 
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long relocjen; 
r the length in blocks of the bad block relocation 
poolV 
short intpv_num; 

r the number of the physical volume */ 
short intpv_staius; 
I* status of the physical volume V 
jWefineLVM NOTVLDPV 0 Tnon valid physical volume*/ 
#define LVMJ/AUDPV 1 T valid physical volume*/ 
struct da^sajnfo da ff[LVM.PVMAXVGDAS]; 
r array of structures to contain timestamp information 

about VGDAs on one PVr 
short intindex^newestda; 

r index of VGDA copy on PV which has newest timestamp r 
short int index_nextda; 

r index of VGDA copy on PV which is next written */ 
)pvfflLVM_MAXPVSl; 

r an'ay of physical volumes, indexed by order in input 
parameter list*/ 
long Ivmareajen; 

/* the length of the entire LVM reserved area on disk */ 
long vgdajen; 

r length of the volume group descriptor area */ 
long vgsajen; 

r length of the volume group status area *l 
short intnum__desclps; 

/* the number of logical partitions per PV needed for the 

descriptor / status area logical volume /* 
short int pp_size; 

r the size of a physical partition for this volume group */ 

1: 

stnjctmwcjnfo 

r stmcture to contain timestamp information about a mirror 
write cache area*/ 

{ 

struct timestmcjts; 
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rtimeslamp value*/ 
short iritgoodjnwx; 

r flag which indicates if the MWCC could not be read / 
short intwrtjstatus; 

r indicates whether this MWCC is to be written V 

1; 

struct defpvsjnfo 

r information structure for PVs defined into the kernel / 

{ 

struct 

short intinjndex; 

r con-esponding index into the input PV information 

structure for this PVV 
short intpvjstatus; 

r indicates if this PV is defined into the kernel */ 
#define LVM_NOTDFND 0 T this PV not defined in kernel*/ 
#define LVM_DEFINED 1 /* this PV defined in kernel */ 

struct dajajnfo saff[LVM_PVM/0(VGDAS]; 

r array of stmctures to contain timestamp nformation 
about VGSAs on one PV*/ 

struct mwcjnfo mwc; 

r structure to contain inforrnation about the min'or 

write consistency cache on this PV */ 
}pvff[LVM_MAXPVS]; 
/* array of physical volumes indexed by PV number */ 

inttotal_vgdas; 

/* total number of volume group descriptor/status areas */ 
stmct timestmc J newest Jats; 

r newest good timestamp for the volume group descriptor area 

stmct timestmcj newestjats; 

r newest good timestamp for the volume group status area */ 

stmct timestmcj newest_mwcts; 

r timestamp for newest mirror write consistency each */ 

}; 
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r 

* Function declarations 
V 

#ifndef.N0_PROTO 

r 

* bt)dirutl.c 
V 

intlvm_bbdsane( 
char* but); 

r buffer containing the directory to checl< */ 

intlvmjgetbbdir( 
int pvjd, 

r the file descriptor for this physical volume device V 
char*buf, 

r a buffer into which the bad block directory will be read V 
intdirjlg); 

r flags to indicate which directory to read V 

int Ivmjdbbdir ( 
intpvjd, 

r the file descriptor for this physical volume device V 
char*buf, 

r a buffer into which the bad block directory will be read V 
intactjg); 

r flag to indicate type of action requested V 
intlvm_wrbbdir( 
int pvJd, 

r the file descriptor for this physical volume device V 
char*bbdir_buf, 

r a buffer containing the bad block directory V 
int dirjig); 
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r flags to indicate which directory to write */ 



r 

*bblstutLc 
*/ 

voidlvm_addbb( 

struct bad_blk"head_ptr, ^ ^u. ^ • . */ 

r a pointer to the pointer to the head of the bad block linked list / 

struct bad.blk * bbj)tr); _ ^ ^ . a. 

I* pointer to the bad block stnicture which is to be added to the 

list*/ 
intlvm_bldbblst( 

intpv_fd, . 
r the file descriptor for this physical volume device / 
struct pvorpvol_ptr, .... . »u 

r a pointer to a structure which describes a physical volume for the 

logical volume device driver (LVDD) */ 
daddrJrelocjDsn); . .u. ^ ju i 

r the physical sector number of the beginning of the bad block 
relocation pool V 

void Ivm chgbb ( 
struct badjik* head j)tr, 

r a pointer to the head of the bad block linked list V 
daddr_tbad_blk, 

r the bad block whose data is to be changed */ 
daddr_treloc_blk, 

r the new value for the relocation block, if it is to be changed / 
intchgtype); 

r type of change requested (change relocation blocker status field) 
for this bad block V 
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* chkquorum.c 
*/ 

intlvm_chkquorum( 

struct varyonvg*varyonvg, . , 

/* pointer to the structure which corrtains input parameter data tor 
the lvm_varyonvg routine */ 

^Vutie descriptor for the volume group reserved area logical 
volume*/ 

struct inpvsJnfo*inpvsJnfo, ^..^ , nw 

r structure which contains infonnation about the input list ot Kvs 

for the volume group*/ 
stnjct defpvsjnfo * def pvsjnfo, ' 

r structure which contains infonnation about volume group descn()tor 
areas and status areas for the defined PVs in the volume group / 
caddr_tvgdaj)tr, ^ 

!* pointer to the volume group descriptor area / 
struct vgsa_area **vgsaj)tr, ^ 

/* pointer to the volume group status area / 
daddrJvgsaJsnfflLVM_M/0(PVSlfl[LVM_PVMAXVGDAS], 

/* array in which to store the logical sector number addresses of the 

VGSAs for each PV*/ 
charmwccfipBSIZEl); ^ 
r buffer in which the latest min'or write consistency cache will be 

returned*/ 

intlvm_vgsamwcc( 
intvgjd, 

r file descriptor for the VG resen/ed area logical volume which 
contains the volume group descriptor area and status area */ 
struct inpvsJnfo*inpvsJnfo, 

r structure which contains infonnation about the input list of PVs tor 

the volume group*/ 
struct defpvsjnfo * defpvsjnfo, 

r pointer to stnjcture which contains infomiation about PVs defined 
into the kernel V 
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caddr_tvgdaj)tr, 

r pointer to the volume group descriptor area V 
long quomm, 

/* number of VGDAs/VGSAs needed to varyon in order to ensure that the 

volume group data is consistent with that from previous varyon V 
struct vgsa_area ** vgsaj)tr, 

r variable to contain the pointer to the buffer which will contain 

the volume group status area V 
daddr J vgsajsn ff[LVM_MAXPVS] fpM.PVMAXVGDAS), 
r array in which to store the logical sectoFnumber addresses of the 

VGSAs for each PVV 
charmwccff[DBSIZE]); 

r buffer in which the latest minror write consistency cache will be 
retumed V 



r 

* comutl.c 
7 

intlvm_chkvafyon( 
stnjct unique Jd * vgjd); 
r the id of the volume group 7 

void lvm_mapoff ( 
structfheader*mapfilehdr, 

r a pointer to the mapped file header which contains the offsets of 

the different data areas within the mapped file 7 
caddr_tvgdaj)tr); 

r a pointer to the beginning of the volume group descriptor area 7 

int lvm_openmap ( 
stmctuniquejd*vgjd, 
r a pointer to the volume group id 7 
int mapf_mode, 

r the access mode with which to open the mapped file 7 
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inl*vgmapjd, 

r pointer to the variable in which to return the file descriptor of the 

mapped file*/ 
caddr_t*vgmap_ptr); 

r pointer to the variable in which to return the pointer to the 
beginning of the mapped file V 

intlvmjelocmwcc( 
intpvjij, 

r file descriptor of physical volume where block containing the min'or 

write consistency cache needs to be relocated V 
charmwccfipBSIZE]); 

I* buffer which contains data to be written to minror write consistency 
cache*/ 

intlvm_rdiplrec( 
intpyjd, 

I* the file descriptor for the physical volume device 7 
IPL_REC_PTRipLrec); 

r a pointer to the buffer into which the IPL record will be read V 

int Ivmjscomp ( 
struct timestmc_fts1, 

/* first timestamp value V 
stmcttimestrucj*ts2); 

r second timestamp value V 

int lvm_updtime ( 

struct timestrucj * begjme, 

r a pointer to the beginning timestamp to be updated '/ 
struct timestrucj * endjme); 

I* a pointer to the ending timestamp to be updated 7 



r 

* crtinsutl.c 
7 
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int Ivmjnitbbdir ( 
int pvjd, 

r the file descriptor for the physical volume device */ 
daddrJrelocj)sn); 

/* the physical sector number of the beginning of the bad block 
relocation pool 

voidlvmJnitivmrec( 
struct Ivmjec * Ivm jec, 
r pointer to the LVM information record V 
short int vgdajsize, 

r the length of the volume group descriptor area in blocks V 
short int ppsize, 

^physical partition size represented as a power of 2 7 
long data capadfy); 

r the data capacity of the disk in number of blocks *l 
intlvmjnstsetup( 
struct uniquejd * vg__id, 

r pointer to id of the volume group into which the PV is to be 

installed */ 
char*pv_name, 

r a pointer to the name of the physical volume to be added to the 

volume group V 
short intoven-ide, 

r flag for which a true value indicates to ovemde a VG member error, 
if it occurs, and install the physical volume into the indicated 
volume group */ 

struct uniquejd * cur_vgjd, 

r structure In which to return the volume group id, if this P Vs 

LVM record indicates it is already a member of a volume group 7 
int * pvJd, 

/* a pointer to where the file descriptor for the physical volume 

device will be stored 7 
IPLREC.PTR ipljec, 

/* a pointer to the block into which the IPL record will be read 7 
struct Ivmjec * lvm_ rec, 
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r a pointer to ttie block into which the LVM information record will 

be read V 
long * datajcapacrty); 

r the data capacity of the disk in number of sectors V 

voidlvmj)ventiy( 
struct uniquejd*pvjd, 

/* pointer to a structure which contains id for the physical volume for 

which the entry is to be created V 
stnjct vgLheader * vghdr_ptr, 

r a pointer to the volume group header of the descriptor area V 
struct pv__header ** pvj)tr; 

r a pointer to the tieginning of the list of physical volume etntries 

in the descriptor area V 
long num_parts, 

r the number of partitions available on this physical volume V 
daddrj beg_psn, 

r the physical sector number of the first physical partition on this 

physical volume 7 
short int num^vgdas); 

/* the number of volume group descriptor areas which are to be placed 
on this physical volume V 

intlvm vgdas3to3( 
intlvjd. 

r the file descriptor of the LVM reserved area logical volume V 
caddrj vgmapj)tr, 

r pointer to the beginning of the mapped file V 
short int new_pv, 

/* the PV number of the new physical volume which is being added V 
short intsave_pvj2^ 

r the PV number of the physical volume which previously had two copies 

oftheVGDAV 
short intsave_pvj); 

r the PV number of the physical volume which previously had one copy 
oftheVGDA*/ 
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int Ivmjvgmem ( 
struct uniquejd*pvjd, 

r pointer to id of the physical volume for which we are to determine 

membership in the specified VQ V 
caddr_tvgdaj)tr); 

/* pointer to the beginning of the volume group descriptor area V 

intlvm_zereomwc( 
intpvjd, 

r the file descriptor of the physical volume V 
short int newvg); 

r flag to indicate if this is newly created volume group V 

int lvm_zerosa ( 
intlvjd, 

r the file descriptor for the LVM reserved area logical volume V 
daddrj sajsn f([LVM.PVMAXVGDAS]); 
r the logical sector numbers within the LVM resented area logical 

volume of where to initialize the copies of the volume group status 

area*/ 



r 

* configutl.c 
7 

intlvm_addmpv( 
stmctuniquejd*vg_id, 

r the volume group id of the volume group which is to be added into 

the kernel V 
long vg_major, 

r the major number where the volume group is to be added V 

short int pv_num); 

r number of the PV to be deleted from the volume group V 

intlvm_addpv( 
long partlen_blks, 
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1 



t the length of a partition in number of 512 byte blocks V 
short intnumjdescips, 

r the number of partitions needed on each physical volume to contain 

the LVMresen/ed area V 
devj device, 

r the major / minor number of the device V 
intpvjd, 

r the file descriptor of the physical volume device V 
short intpv_num, 

r the index number for this physical volume V 
longvgLmajor, 

/* the major number of the volume group V 
struct uniquejd*vg_id, 

/* the volume group id of the volume group to which the physical 
volume is to be added V 
daddrJrelocj)sn, 

r the physical sector number of the beginning of the bad block 

relocation pool V 
long relocjen, 

r the length of the bad block relocation pool V 
daddr_tpsn_parl1, 

r the physical sector number of the first partition on the physk^ai 
volume V 

daddrj vgsaJsnff[LVM_PVMAXVGDAS], 
short intquorum_cnt); 

/* the number of VGDAsA/GSAs needed for a quonjm */ 

intlvm_chgqnn( 
struct uniquejd * vg_id, 
I* the volume group id of the volume group 7 
long vg__major, 

/* the major number of the volume group V 
short int quorum jcnt); 

/* number of VGDA/VGSA copies needed for a quorum V 
intlvm_chgvgsa( 
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Struct unique Jd*vgLicl, 
r the volume group id V 
long vg_major, 

r the major number of the volume group V 
daddrj vgsajsn fpM.PVM AXVGDAS], 

r array of logical sector number addresses of the VGSA copies on this 

PV*/ 
short int pv__num, 

/* number of the PV which is to have changes to the number of VGSAs V 
short int quorum_cnt, 

/* number of VGDAs/VGSAs needed for a quorum V 
int command); 

r command value which indicates the config routine to be called is 
' that for adding/deleting VGSAs V 

inllvm_chkvgstat( 
stnjcl varyonvg * varyonvg, 

r pointer to the structure which contains input information for 

varyonvg V 
int*vgstatus); 

r pointer to variable to contain the varied on status of the volume 
group */ 

intlvm_config( 
midj kmid, 

r the module id for the object module which contains the logical 

volume device driver*/ 
long vg_major, 

r the major number of the volume group */ 
int request 

r the request for the configuration routine to be called within the 

kernel hd__config routine */ 
stmct ddijnfo * cfgdata); 

r structure to contain ttie input parameters for the configuration 
device driver V 

intlvm_defvg( 
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longpartlen_biks, 

r the length of a partition in tmbBf of 51 2 byte blocks / 
short intnumjdesclps, 

/* the number of partitions needed on each physical volume to contain 
the LVM reserved area V 

midjkmid, . j'. *, 

r the module id which identifies where the LVDD code is loaded / 
longvgLmajor, 

I* the major number where the volume group is to be added / 
stmct unique Jd*vg_id, 

/* the volume group id of the volume group which is to be added into 

the kernel*/ 
short intppsize, 

r the physical partition size, represented as a power of 2 of the 
size in bytes, for partitions in this volume group */ 

long noopenjvs); 

r flag to indicate if logical volumes in the volume group are not 
allowed to be opened V 

intlvm__delpv( 

stmct uniquejd * vgjd, 

r the volume group id of the volume group which is to be added into 
the kernel */ 
long vg_major, 

/* the major number where the volume group is to be added V 
short int pv_num, 

r number of the PV to be deleted from the volume group V 
short intnumjdesclps, 

r number of logical partitions in the descriptor / status area 

logical volume for this PV */ 
int flag, 

r flag to indicate whether the PV is being deleted from the volume 

group or just temporarily removed V 
short intquommjcnt); 
r quomm count of logical volume V 
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void lvm_delvg ( 
struct uniquejd * vg_id, 

I* the volume group id of the volume group which is to be added into 

ttiekemelV 
longvg_major); 

r the major number where the volume group is to be added V 



r 

* lvmrecutl.c 
*/ 

void lvm_cmplvmrec( 

stmct uniquejd *vgid, T pointer to volume group id 7 

char *match, indicates a matching vgidV 

char pvnameff[LVM_NAMESIZl); T name of pv to read Ivm rec from */ 

intlvm_rdlvmrec( 
intpvjd, 

r the file descriptor for the physical volume device V 
staictlvm_rec*lvm_rec); 

r a pointer to the buffer into which the LVM infomiation record will 

be read */ 
int Ivmjwrlvmrec ( 
intpvjd, 

r the file descriptor for the physical volume device V 
stmct Ivmjec * Ivmjec); 

r a pointer to the buffer which contains the LVM infomiation record 
to be written V 

void lvm_zerolvm ( 
intpvjd); 

I* the file descriptor for the physical volume device V 



r 
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*quefyutl.c 
V 

extern int Ivmjchklvclos ( 
struct lvjd*lvjd, 

riogical volume id */ . 
long major_num); 

r major number of volume group 7 

extern int lvm_gelpvda( 
char*pv_name, 

r a pointer to the name of the physical volume to be added to the 

volume group V 
char" mapj)tr, 

r a pointer to where the pointer to the memory area containing the 

mapped file information will be stored */ 
int rebuild); 

r indicates we are rebuilding the vg file V 
extem int lvmjgettsinfo( 

int pvfd, /* file descriptor for physical volume V 
daddrj psnff[LVM_PVMAXVGDAS]. 

/* array of physical sector numbers for VGDAS */ 
long vgdaien, 

r length of volume group descriptor area V 
int *copy, T copy of VGDA with newest timestamp V 
int rebuild); 

r indicates we are rebuilding the vg file V 



r 

* rdex_com.c 

V 

extem int rdexj)roc( 

stmctlvjd 'Ivjd, riogical volume id V 

stmct extjediv *ext_red, /* maps of pps to be extended or reduced V 
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char \gfptr, /* pointer to volume group mapped file V 

int vgra, T volume group file descrifytor / 

short minor_num, r minor number of logical volume V 

int indicator); indicator for extend or reduce operation V 

r 

* revajyon.c 
V 

int ivmjevaryon ( 
struct varyonvg * varyonvg, 

r pointer to a stmcture which contains the input information for 

tfie lvm_varyonvg sul)routine V 
intvgmapjd, 

I* the file descriptor for the mapped file V 
stmct inpvsjnfo * inpvsjnfo, 

r a pointer to the stmcture which contains information about PVs 

from the input listV 
struct defpvsjnfo * defpvsjnfo); 

r stmcture which contains infomiation about volume group descriptor 
areas and status areas for the defined PVs in the volume group V 

int lvm_vonmisspv ( 
stmct varyonvg * varyonvg, 

r pointer to a stmcture which contains the input infomiation for 

the lvm_varyonvg subroutine V 
stmct inpvsjnfo * inpvsjnfo, 

r a pointer to the stmcture which contains infomiation about PVs 

from the input list */ 
stmct defpvsjnfo * defpvsjnfo, 

/* stmcture which contains information about volume group descriptor 

areas and status areas for the defined PVs in tfie volume group V 
stmct fheader * maphdrj)tr, 

r pointer to the mapped file header V 
caddrjvgdaj)tr, 

r pointer to the beginning of the volume group descriptor area V 
stmct pv_header * pvjjtr, 
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/* a pointer to the header of a physical volume entry in the volume 
group descriptor area V 

intvgjd, ^ , . , 

I* the file descriptor for the volume group reserved area logical 

volume V 
short intinjndex, 

r index into the input list of a physical volume */ 
int ' chkwgout); 

r flag to indicate if the varyonvg output structure should be 
ched^ed */ 



r 

*setupvg.c 

V 

intlvm__setopvg( 

stmct varyonvg * varyonvg, 

Tpointer to the stnictore which contains input information for 

varyonvg V 
stmct inpvsjnfo* inpvsjnfo, 

r a pointer to the stmclure which contains information about PVs 

from the input list */ 
stmct defpvsjnf 0 * def pvsjnfo, 

r pointer to the stmcture which contains infomiatlon about the physical 

volumes defined into ttie kernel V 
stmct fheader * maphdrj)tr, 

r a pointer to the file header portion of the mapped file V 
intvgjd, 

r file descriptor of the volume group reserved area logical volume / 
caddr tvgdaj)tr, 

r a pointer to the in-memory copy of the volume group descriptor 
area V 

stmct vgsa_area * vgsaj3tr, 

r a pointer to the volume group status area */ 
daddrj vgsajsn ff[LVM_MAXPVSl fpfi4_PVMAXVGDAS], 

r anray of logical sector number addresses of all VGSA copies 7 
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Struct mwc rec*mwcc); 

r iHjffer which contains latest mirror write consistency cache */ 

intlvm_bldldvlp( 
caddrJvgdajDtr, 

r a pointer to the volume group descriptor area V 
stnjct vgsa_area * vgsajitr, 

r a pointer to the volume group status area V 
stnjct Ivol * lvolj)trs fpM_MAXLVS]); 

r array of pointers to the LVOD logical volume structures V 

intlvm_mwcinfo( 

stmct vaiyonvg * varyonvg, 

r pointer to the stmcture which contains input infomiation for 

varyonvg V 
stnict inpvsjnfo * inpvsjnfo, 

r a pointer to the stmcture which contains information about PVs 

from the input list*/ 
stmct defpvsjnfo * defpvsjnfo, 

r pointer to structure which contains information atx)ut the physical 

volumes defined into the kemel V 
stnjctfheader * maphdrj)tr, 

r a pointer to the file header portion of the mapped file 7 
intvgjd, 

r file descriptor of the volume group reserved area logical volume V 
caddrjvgda_ptr, 

r a pointer to the volume group descriptor area V 
struct vgsa_area * vgsaj)tr, 

r a pointer to the volume group status area V 
daddrj vgsajsn ff[LVM_MAXPVSl ff[LVM_PVMAXVGDASl, 

r anay of logical sector number addresses of all VGSA copies V 
struct Ivol * Ivol j)trs fpM.MAXLVS], 

r array of pointers to LVDD logical structures V 
struct mwcjec * mwcc, 

r buffer which contains latest minor write consistency cache V 
struct mwcjec * kmwcc, 

r buffer to contain list of logical track groups from the MWCC which 
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need to be resynced in the kernel *l 
short int * num_entries); , ^ u u 

r number of logical track group entries in the kernel MWCC ixjtter / 



r 

* synclp.c 
V 



extern intsynclp( 

int Ivfd, riogical volume file descriptor*/ 

stnjct Iv^entries *lv, /* pointer to logical volume entry V 
stnjctuniquejd *vg_.id, /* volume group id 7 
char *vgptr, /* pointer to the volume group mapped file/* 
int vgfd, /"volume group mapped file descriptor*/^ 

short minorjium, 1* minor number of the logical volume */ 
long Ipnum, riogical partition number to sync*/ 

int force); /'resync any non-stale Ip if TRUE*/ 



r 

* utilities.c 
*/ 

extern int getJvinfo( 

struct Ivjd *lvjd, /* logical volume id 7 

stnict uniquejd *vg_id, T volume group id 7 

short ~*minornum, /* logical volume minor number 7 

int *vgfd, /* volume group file descriptor 7 

char **vgptr, /•pointer to volume group mapped file 7 

int mode); T how to open the vg mapped file 7 

extem intgetj)trs( 

char *vgmptr, 1* pointer to the beginning of the volume 7 

/* group mapped file 7 
stnict fheader "header, /* points to the file header 7 
stnjct vg_header "vgptr, I* points to the volume group header 7 
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Struct lv_entries **lvptr, T points to ttie logical volume entries 7 
struct pv__header "pvptr, /* points to the physical volume header */ 
struct pp_entries **ppj)tr, T points to the physical partition entries */ 
struct namelist **nameptr); /* points to the name descriptor area V 

extern int lvm_errors( 

char failingLrtnff[LVM_NAMESIZ|, T name of routine with error 7 
char callingLrtnfpM_NAMESCl, /• name of calling routine 7 
int rc); Tensor returned from failing rtn 7 

extern int getj)vandpp( 

struct pv_header **pv, /* pointer to the physical volume header 7 
struct ppjentries "pp, r pointer to the physical partition entry 7 
short *pvnum, /* pv number of physical volume id sent in 7 

char *vgfptr, /* pointer to the volume group mapped file 7 

struct uniquejd *id); T id of pv you need a pointer to 7 

extern int bldlvinfo( 

struct logview **lp, T pointer to logical view of a logical vol 7 
char \gfptr, /* pointer to volume group mapped file 7 
stmct lv_entries *lv, /* pointer to a logical volume / 
long *cnt, number of pps per copy of logical volume 7 

short minor_num, /* minor number of logical volume 7 
int flag); T GETSTALE if info on stale pps is desired 7 

rNOSTAl£ifnot7 

extern int status_chk( 

char *vgptr, T pointer to volume group mapped file 7 
char *name, T name of device to be checked 7 
int flag, T indicator to check the major number 7 
char *rawname) ; /* pointer to new raw device name 7 

extern int lvm_specialj3to2( 
char *pvname, 

r name of physical volume being removed 7 
stmct uniquejd *vgid, 

r pointer to volume group id 7 

int Ivfd, 
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r reserved area logical volume file desc V 
char *vgdaptr 

r pointer to vgda area of the vg file V 
short pvnumO, number of pv to delete/remove 7 

short pvnumi , /* number of pv to keep one copy */ 

short pvnum2, I* number of pv to keep two copies V 

char *matcn 

r indicates the vgkl in the Ivm record V 

r matches the one passed in V 

char delete, I* indicates we are called by deletepv *l 

struct fheader ^fhead); 1* pointer to vg mapped file header V 

extern intgetstates( 

struct vgsa.area *vgsa, r pointer to buffer for volume group status V 

Tarea*/ 

char *vgfptr); r pointer to volume group mapped file V 
extem inttimestamp( 

stmct vgjieader *vg, T pointer to volume group header V 
char *vgptr, /* pointer to volume group file 7 

stmct fheader ^fhead); /* pointer to vg file header 7 

extem inttimestamp( 

struct vg^header *vg, /* pointer to volume group header 7 
char *vgptr, T pointer to volume group file 7 
struct fheader *fhead); /* pointer to file header of vg file 7 

extem intbuildname( 

devj dev, /* device info for physical volume 7 

char nameff[LVM_EXTNAME], Taray to store name we create for pv 7 

int mode, r mode to set the device entry to 7 

int type); T type of name to build 7 

extem int rebuild Jle( 

struct uniquejd *vgid, /* pointer to volume group id 7 
int *vgfd); T vg file descriptor 7 
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extern void calcjsn( 

struct ftieader 'fhead, r pointer to volume group file header V 

stmct rebuild Rebuild); T pointer to info from the rebuilding V 

r of the volume group file V 

r 

*varyonvg.c 

V 

intlvmJorceqnn( 
caddrjvgdajjtr, 

r pointer to the volume group descriptor area */ 
struct defpvsjnfo * defpvsjnfo); 

r structure which contains infonnation about volume group descriptor 
areas and status areas for the defined PVs in the volume group V 

voidlvm_mapfile( 
struct varyonvg * varyonvg, 

r pointer to a structure which contains the input infonnation for 

the lvm_varyonvg subroutine V 
stnjct inpvsjnfo * inpvsjnfo, 

r a pointer to the stmcture which contains infonnation about PVs 

from the input list*/ 
struct defpvsjnfo * defpvsjnfo, 

I* pointer to staicture which contains infonnation about PVs which are 

defined into the kernel V 
stmct fheader * mapfilehdr, 

r a pointer to the mapped file header which contains the offsets of 

the different data areas within the mapped file */ 
caddrj vgda_ptr); 

r pointer to the volume group descriptor area V 

void lvm_pvstatus ( 
stmct varyonvg * varyonvg, 

r pointer to the stmcture which contains input parameter data for 
Sie lvm_varyonvg routine */ 
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Struct defpvs Jnfo * defpvsjnfo, 

r pointer to structure which contains information about PVs which are 

defined into the kernel V 
caddrjvgdaj)tr), 

r pointer to the volume group descriptor area 7 
int * missname, 

/* flag to indicate if there are any PV names missing from the input 

list*/ 
int * misspy); 

I* flag to indicate if there are any PVs missing from the varied-on 
volume group (i.e., PVs that could not be defined into the kernel V 

int lvm_update { 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input infonnation for 

varyonvg */ 
stmct inpvsjnfo * inpvsjnfo, 

r a pointer to the stnjcture which contains infonnation about PVs 

from the input list*/ 
struct defpvs^info * defpvsjnfo, 

/* structure which contains infonnation about volume group descriptor 

areas and status areas for the defined PVs in the volume group */ 
stmct fheader * maphdrj)tr, 

r a pointer to the file header portion of the mapped file */ 

int vgjd, 

r the file descriptor for the volume group reserved area logical 

volume 7 
caddrjvgdaj)tr, 

r pointer to the volume group descriptor area 7 
struct vgsa_area * vgsaj)tr, 

r pointer to the volume group status area 7 
daddrj vgsajsn ff[LVM_MAXPVS] ff[LVM_PVM/0(VGDAS], 

r an-ay which contains the logical sector number addresses of all 

the VGSAs 7 
charmwccff[DBSIZE], 

/* buffer containing the latest mirror write consistency cache 7 
intforceqrm, 
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r flag to indicate If the quorum has been forced V 
int misspv); 

r flag to indicate if there were any missing PVs V 



r 

* verify.c 
*/ 

int lvm_verify ( 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input parameter data for 

the lvm_varyonvg routine V 
int * vgjd, 

/* pointer to the variable to contain the file descriptor for the 

volume group reserved area logical volume */ 
struct inpvsjnfo * inpvsjnfo, 

/* structure which contains information about the input list of PVs 

for the volume group V 
struct defpvsjnfo * defpvsjnfo, 

/* structure which contains information about volume group descriptor 

areas and status areas for the defined PVs in the volume group 7 
caddr_t*vgda _ptr, 

/* pointer to variable where the pointer to the volume group descriptor 

area is to be returned */ 
struct vgsa_area "vgsajDtr, 

/* variable to contain the pointer to the buffer which will contain the 

volume group status area */ 
daddr J vgsajsn ff[LVM_MAXPVS] ff[LVM^PVMAXVGDASl, 
/* array in which to store the logical sector number addresses of the 

VGSAs for each PV */ 
char mwcc ff[DBSIZE]); 

/* buffer in which the latest mirror write consistency cache will be 
returned */ 
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intlvmjdefpvs( 

struct varyonvg * varyonvg, 

r pointer to the structure which contains input parameter data tor 
the lvm_varyonvg routine V 

intvgjd, ^ , • , , ♦/ 

r file descriptor of the volume group reserved area logical volume / 

struct inpvsJnfo*inpvsJnfo, 

r structure which contains Information about the input list ot PVs tor 

the volume group*/ 
struct defpvsjnfo * defpvsjnfo); 
r structure which contains infomiation about the volume group 

descriptor and status areas for PVs defined into the kernel */ 

void lvm_getdainfo ( 
intvgLfd, 

r file descriptor for the VG reserved area logical volume which 
contains the volume group descriptor area and status area V 
short intpvjndex, 

r index variable for looping on physical volumes in input list / 
struct inpvsjnfo * inpvsjnfo, 

r stnicture which contains infomiation about the input list of PVs for 

the volume group*/ 
shruct defpvsjnfo * defpvsjnfo); 

/* pointer to stmcture which contains inf onnation about PVs defined 
into the kernel*/ 

intlvm_readpvs( 

struct varyonvg * varyonvg, 

r pointer to the stmcture which contains input parameter data for 

the lvm_varyonvg routine */ 
stmct inpvsjnfo * inpvsjnfo); 

r structure which contains information about the input list of PVs 
for the volume group*/ 

intlvm_readvgda( 
intvgjd, 

r the file descriptor for the volume group reserved area logical 
volume */ 
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short int override, . ^ ^ ... 

r flag which indicates if no quorum error is to be ovenridden / 

struct inpvsjnfo * inpvsjnfo, ^ ,. . x ou 

r structure which contains information about the input list of PVs 

for the volume group*/ 
stnict defpvsjnfo * defpvsjnfo, 

rstruclure which contains infomiation about volume group descnptor 
areas and status areas for the defined PVs in the volume group V 
caddrj*vgdaj)tr); 

r pointer to buffer in which to read the volume group descnptor 
area*/ 

r ^ 
*vonutl.c 

V 

void Ivmjdsinpvs { 

struct varyonvg*varyonvg, , 
r pointer to the structure which contains input parameter data tor 

the lvm_varyonvg routine */ 
stmct inpvsjnfo * inpvsjnfo); . ,. . . ow 

r structure which contains information about the input list of H vs 
for the volume group*/ 

intlvm_deladdm( 

stmct varyonvg * varyonvg, 

/* pointer to the structure which contains input infomiaton for 

varyonvg V 
struct inpvsjnfo * inpvsjnfo, 

I* a pointer to the structure which contains information about PVs 

from the input list*/ 
stmct defpvsjnfo * defpvsjnfo, 

r pointer to stmcture which contains infomiation about the physical 

volumes defined into the kernel */ 
short int pv_num); 
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r the PV number of PV being changed to a missing PV V 

intlvm_vonresync( 
stnjct uniquejd * vgjd); 
r pointer to the volume group id V 



r 

'wrtutlx 
V 

intlvmjdiskio( 
caddrJvgLmapptr, 

r a pointer to the mapped file for this volume group V 
intvg_mapfd); 

r the tile descriptor of the mapped file for this volume group V 

intlvm_updvgda( 
intlvjd. 

r the file descriptor of the descriptor area logical volume, if it 

is already open V 
struct fheader * maphdrjDtr, 

r a pointer to the file header portion of the mapped file 7 
caddrj vgda j)tr); 

I* a pointer to the memory location which holds the volume group 
descriptor area V 

intlvm_wrtdasa( 
intvgjd, 

I* the file descriptor of the LVM resen/ed area logical volume *l 
caddrj areaj)tr, 

r a pointer to the memory location which holds the volume group 

descriptor or status area 7 
stnjct timestmcj * ejmestamp, 

r a pointer to the end timestamp for the area 7 
long areajen, 
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r the length In sectors of the area V 
daddrjisn, 

r the logical sector nmber within the LVM reserved area logical 

volume of where to write a copy of the area *l 
short intwritejorder); 

r flag which indicates whether the area is to be written In the order 
of beginning/middie/end or middle/beginning/end V 

intlvnijivrtmapf ( 
intvgmapjd, 

r the file descriptor of the mapped file V 
caddrjvgmapj)tr); 

r the pointer to the beginning of the mapped file V 

intlvm_wrtnext( 
intlvjd, 

r the file descriptor of the LVM reserved area logical volume V 
caddrJvgdajDtr, 

r a pointer to the memory location which holds the volume group 

descriptor area V 
stnjcttimestrucj *etimestamp, 

r pointer to the ending timestamp in the vg trailer V 
short intpvnum, 

r the PV number of the PV to which the VGDA is to be written */ 
struct fheader * maphdrj)tr, 

I* pointer to the mapped file header V 
short intpvnum_vgdas); 

r number of volume group descriptor areas to be written to the PV V 



#else 

r 

* bbdirutl.c 
V 

intlvm_bbdsane(); 
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intivni_^etbbdir(); 

intlvmjdbbdirO; 

intlvmjwrbbdirO; 

r 

*bblstutl.c 
*/ 

void Ivmjaddbb ( ); 
intlvmJIdbblstO; 
void lvm_chgbb ( ); 

r 

* chkquorum.c 
V 

int lvm_chkquorum ( ): 
intlvm__vgsamwcc(): 

r 

* computl.c 

V 

intlvm_chkvaryon(); 
intlvm__mapoff (); 
intlvm__openmap(); 
intlvm_relocmwcc(); 
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intlvmjdiplrecO; 
intlvmJscompO; 
intlvm_updtime(); 

r 

* crtinsutl.c 
7 

intlvmJnitbbdirO; 

voidlvmJnitlvmrecO; 

intlvmJnstsetupO; 

voidlvmj)ventry 0; 

mtivm_vgdas3to3(); 

intlvm__vgmem(); 

intlvm_zeromwc(); 

intlvm_zerosa(); 

r 

* configutl.c 
V 

lntlvm_addmpv(); 
int lvm_addpv ( ); 
intlvm_chgvgsa(); 
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int Ivm^chkvgstat ( ); 
intlvm_config(); 
intlvmjJefvgO; 
Int Ivmjielpv ( ); 
void lvm_delvg ( ); 

r 

* Ivmrecuti.c 
7 

void lvm_cmplvmrec ( ); 
int Ivmjdivmrec ( ); 
int lvm_wrivmrec ( ); 
void lvm_zerolvm ( ); 

r 

*queryutl.c 

V 

extern int lvm_chklvclos ( ); 
'extemintlvmjgelpvdaO; 
extern int Ivmjgettsinfo ( ); 

r 

* rdex_com.c 
V 
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extern int rdexj)roc ( ); 

/* struct lv_id *lvjd logical volume id 
struct exljedlv *extjed maps of pps to be extended or reduced 
int indicator indicator for extend or reduceperationV 



r 

* revaryon.c 
V 

intlvmjevaryonO; 
void lvm_vonmisspv ( ); 



r 

* setupvg.c 
V 

intlvmjetupvgO; 
intlvm_bldklvlp(); 
intlvm_mwcinfo(); 



r 

* synclp.c 
V 

extern intsynclp(); 

r int ivfd logical volume file descriptor 

struct lv__entries *lv pointer to logical volume entry 
char *vgptr pointer to the volume group mappedfile 

int vgfd volume group mapped filedescriptor 

short minor_num minor number of the logicalolume 

int Ipnum logical partition number to sync 
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struct uniquejd *vgjd volume group id 

int force resync any non-stale Ip if TRUE V 



r 

* utilities.c 
V 

extern int getJvinfoO; 
r struct uniquejd *vg_id volume group id 
struct Ivjd "ivjd logical volume id 



int *vgfd volume group file descriptor 
short *minor_num logical volume minor number 
char **vgptr pointer to volume group mapped file 

int mode how to open the vg mapped file*/ 



extern intget_ptrs(); 

r struct fheader "header points to the file header 

stmctvgheader **vgptr points to the volume group header 

struct lv_entries "Ivptr points to the logical volume entries 

stmct pv Jeader **pvptr points to the physical volume header 

stmct pp_entries "ppj)tr points to the physical partition ents 
struct namelist **nameptr points to the name descriptor area *l 

extern intlvm^efforsO; 

/* char failing_rtnfpM__NAMESIZ] name of routine with error 
char calling_rtnff[LVM_NAMESi;^ name of calling routine 
int rc error returned from failing rtn V 

extern int get j)vandpp(); 

/* struct pvjeader **pv pointer to the physical volume header 
struct ppjentries **pp pointer to the physical partition entry 
short *pvnum pv number of physical volume id sent 
char Vgfptr pointer to the volume group mapped file in 
stmct uniquejd *\6 id of pv you need a pointer to 
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extern intbidlvinfo(); 

r struct logview **lp pointer to logicai view of a logical vol 
dw *vgfptr pointer to volume group mapped file 

stmct lv_entries *iv pointer to a logical volume 
long " *cnt numt>er of pps per copy of logical volume 
short minor.num minor number of logical volume V 

extern intstatus_chk(); 

r char *vgptr pointer to volume group mapped file 
char ^narne name of device to be checked 
int flag indicator to check the major number 
char ^rawname pointer to new raw device name V 

extern int lvmj5pecial.3to2 ( ); 
/* char *pvname, 

name of physical volume being removed 
struct uniquejd \gid, 

pointer to the volume group id 



int 


Ivfd, 




reserved area logical volume file desc 


char 


*vgdaptr, 




pointer to vgda area of the vg file 


short 


pvnumO, number of pv to delete/remove 


short 


pvnumi , number of pv to keep one copy 


short 


pvnum2, number of pv to keep two copies 


char 


match 




indicates the vgid in the Ivm record matches 




the vgid passed in 


char 


delete indicates we are called by deletepv 


stmct fheader 


^fhead pointer to vg mapped file header 



V 

extem intgetstates(); 

/* struct vgsa.area \gsa pointer to buffer for volume group status 

area 

char 'vgfptr pointer to volume group mapped file 

V 
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extern inttimestanipO; 
/• struct vgLheader*vg, 
char *vgptr, 
struct ftieader *fliead); 
*/ 

extern inttimestampO; 

r 

struct vgLheader*vg, 
char *vgptr, 
struct fheader *fhead) 
V 



pointer to volume group header 
pointer to volume group file 
pointer to vg file header 



pointer to volume group header 
pointer to volume group file 
pointer to file header of vg file 



extern intbuildnameO; 

r devj dev, device info for physical volume 

char nameff[LVM_EXTNAMEl, array to store name we create for pv 

int mode, mode to set the device entry to 

int type); type of name to build 

*/ 

extem int rebuild Jle(); 

r 

struct uniquejd *vgid, pointer to volume group id 

int *vgfd); vg file descriptor 

V 

extem void calcJsnO; 

r 

stmct fheader *fhead, pointer to volume group file header 

struct rebuild Rebuild); pointer to info from the rebuilding 

of the volume group file 

V 



varyonvg.c 
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intivm_deladdm(); 
intlvmJorceqrmO; 
void Ivm.mapfile ( ); 
intlvm_pvstatus(); 
intlvm_update(); 

r 

*verify.c 

V 

intlvm_verify(); 

void Ivmjclsinpvs ( ); 

intlvm__defpvs(); 

voidlvmjgetdainfoO; 

intlvmjeadpvsO; 

intlvmjeadvgdaO; 

r 

* wrtutl.c 
V 

intlvm_diskio(); 
void lvm_updvgda ( ); 
intlvm_wrtdasa(); 
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intivmjwrtdasaO; 
intlvmjwrtmapf {); 
intlvm_wrtnext(); 



#endif r.NO.PROTOV 
#endif T H UBLVM V 



r DASD.H V 

#ifndef_H_DASD 
Adeline _H_DASD 

r 

'COMPONENT_NAME: (SYSXLVM) Logical Volume Manager - dasd.h 



* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* All Rights Reserved 

7 

r 

* Logical Volume Manager Device Driver data stmctures. 
7 

#include < sys/types.h > 
#include < sys/sleep.h > 
include < sys/lockl.h > 
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include < sys/sysmacros.h > 
include <sys/buf.h> 
include <sys/ivdd.h> 

r FIFO queue structure for scheduling logical requests. V 
struct hd_queue{ T queue header structure V 

struct but *head; T oldest request in the queue V 
stnictbuf -tail; T newest request in the queue *l 

}; 

struct hd_capvq { r queue header structure V 

struct pvjwait *head; T oldest request in the queue */ 
struct pv_wait *tail; r newest request in the queue V 

}; 
r 

* Stnjcture used by hd jedquiet( ) to mari( target PPs for removal. 

* Both are zero relative. 
V 

struct hdjvred { 

long Ip; /* LP the pp belongs to */ 
char mirror; I* mirror number of PP */ 

1: 
r 

* Physical request buf structure. 

* A 'pbuf is a 'buf stmcture with some additional fields used 

* to track the status of the physical requests that correspond to 

* each logical request. A pool of pinned pbufs is allocated and 

* managed by the device driver. The size of this pool depends on 

* the number of open logical volumes. 

stmctpbuf{ 

r this must come first, 'buf pointers can be cast to 'pbuf V 
stnjct buf pb; T imbedded buf for physical driver V 
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r physical buf structure appendage: V 

struct buf *pbjbuf; /* corresponding logical buf struct*/ 

I* scheduler I/O done policy function V 
#ifndef_NO.PROTO 

void fpbjsched) (struct pbuf *); 

#else 

void fpb sched)(); 

#endif 

struct pvcl *pbj)vol; /* physical volume structure */ 
struct badjik *pb_bad; /* defects directory entry */ 
daddrj pbjtart; /* starting physical address 7 



pb_mirror; /* current min"or / 
pb_miravoid; /* min^or avoidance mask 7 
pb_mirbad; T mask of broken mirrors 7 
pb_mirdone; T mask of mirrors done I* 

pbjwretry; I* number of sw relocation retries 7 
pbjype; /* Type of pbuf 7 
pb_bbop; r BB directory operation 7 
pb_bbstat; /* status of BB directory operation 7 

pb_whl_stop; r wheeljdx value when this pbuf is7 
/* to get oif of the wheel 7 

pbjwjeloc; /* Debug - it was a HW reloc request7 
pad /* pad to full long word 7 



char 
char 
char 
char 

char 
char 
char 
char 

uchar 

ifdefDEBUG 
ushort 
char 

#else 

char 

#endif 



padff[3]; 



r pad to full long word 



7 



. struct part *pbj)art; 7 ptr to part structure. Caremust7 

7 be taken when this is used since 7 
7 the parts structure can be moved 7 
7 by hdjonfig routines while the 7 
7 request is In flight 7 
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Struct uniquejd *pbj/gid; */ volume group ID 
r used to dump the allocated pbuf at dump time 
struct pbuf *pbJorw; V forward pointer 
struct pbuf ^pb.back; V backward pointer 



V 



}; 



#define pbjKldr pbi)_un.b.addr V too ugly in its raw f onn */ 
/•defines for pb_swretryV 



/* values for b work in pbuf stmct (since real b_work value only used 

*inlbuf) 
7 

#defineFlX_READ_ERROR 1 Vfix a previous EMEDIA read en'or V 
#define FIXJSOFT 2 rfixareadorwriteESOFen-or V 
#defineFIX_EMEDIA 3 rfixawriteEMEDlAeffor *i 

/* defines for pbJypeV 

#define SA_PVMISSING 1 /* PV missing type request / 

Udefine SAJTALEPP 2 r stale PP type request V 

#define SA_FRESHPP 3 TfresfiPP type request V 

#define SA.CONFIGOP 4 r hdjconfig operation type request V 

r 

• defines to tell tid.bldpbuf what kind of pbuf to build 



* These defines are not the only ones that tell hd_bldpbuf what to 

* build. Check the routine before changing/adding new defines here 
V 

#define CATYPEJVRT 1 /* pbuf struct is a cache write type */ 

r 

•defines for pb_bbop 



• Rrst set is used by the requests pbuf that is requesting the BB operation. 



#defineMAXSWRETRY3 



*/ maximum retries for relocation 
before declaring disk dead V 
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' The second set is used in 
* actual reading and writing 
V 

«defineBB_ADD 
#defineBB_UPDATE 
#defineBB_DELETE 43 
#define BB_RDDFCT 
^define BB.WTDFCT 
#define BB SWRELO 



#define RD 
#defineWT 
#defineWT 
#defineWT 
#definem 



BBPRIM 

UBBPRIM 

"DBBPRIM 

"UBBBACK 

DBBBACK 



the bb j)buf to control the action of the 
of the BB directory of the PV. 

41 r Add a new bad block entry to BB directory V 

42 r Update a bad block entry to BB directory */ 
r Delete a bad block entry to BB directory V 

44 r Reading a defective block V 

45 /* Writing a defective block V 

46 r Software relocation in progress V 

70 r Read the BB primary directory */ 

71 r Write BB prim dir with UPDATE V 

72 r Rewrite BB prim dir 1 st bik with UPDATE V 

73 r Write BB backup dir with UPDATE 7 

74 r Rewrite BB back dir 1 st bIk with UPDATE 7 



r defines for pb.berror; 0-63 (good) 64-1 27 (bad) 7 
#define BB_SUCCESS 0 /* BBdir updating worked / 
#define BB_CRB 1 1* Reloc bikno was changed in this BB entry 7 

#define BBJRROR 64 /* Bad Block directories were not updated 7 
#define BB FULL 65 /* BBdir is full -no free bad bIk entries 7 



r 

* Volume group structure. 

* Volume groups are implicitly open when any of their logical volumes are. 
7 

#define MAXVGS 255 /* implementation limit on # VGs 7 

#define MAXLVS 256 T implementation limit on # LVs 7 

#define MAXPVS 32 1* implementation limit on number7 

/* physical volumes per vg 7 
#define CAHHSIZE 8 /* Number of mwc cache queues 7 
#define NBPl (NBPB * sizeof (int)) T Number of bits per int 7 
#define NBPL (NBPB * sizeof (long)) r Number of bits per long 7 
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r macros to set and clear the bits in the opnjin array V 
#define SETLVOPN(Vg,N) (|Vg)-> opn j)inffI(N)/NBPI] |=1 < < {(N)7oNBPI)) 
SeCLRLW^^^^ M>Winifl(NA 'd «((N)%NBPI))) 
#define TSTLVOPN{Vg.N) ((Vg).> opnj)inff[(N)/NBPIl & 1 « ((N)%NBPI)) 

r 

* macros to set and dear the t)its in the caj)v_wrt field 

* NOTE TSTALLPVWRT will not work if max PVs per VG is greater than 32 
V 

#define SETPVWRT(Vg^) ((Vg)->caj)y.wrtlf[(N) /NBPL] 1= 1« ((N) % MAXPVS)) 
#define CLRPVWITr(Vg,N) ((Vg)-> cajv wrtlf[(N) / NBPq &= ((N) % MAXPVS))) 
#defineTSTPVWRT{Vg,N) ((Vg)->caj)v>tffl(N)/r©PLl&(1<<«N)% MAXPVS))) 
#define TSTALLP VWRT(Vg) ((Vg)-> caj)v_wrtlf[(MAXPVS - 1 ) / NBPL]) 

r 

* head of list of varied on volgrp stnjds in the system 
V 

struct 

' lockj lock; r lock while manipulating list of VG stmcts */ 

stmct volgrp * ptr; /* ptr to list of varied on VG structs */ 
lhd_vghead = {EVEKT_NUIl, 



ptr 10 1 
^fULL}; 



stnjct volgrp { 

lockj vgLlock; T lock for all vg structures 7 

short padi; /* pad to long word Iwundary V 

short partshift; Tlog base 2 of part size in biks */ 

short open_count; /* count of open logical volumes V 

ushort flags; T VG flags field *l 

ulong totJo_cnt; T number of logical request to VG V 

stmct Ivol *lvolsff[MAXLVS]; T logical volume struct array V 

stmct pvol *pvolsff[MAXPVS]; I* physical volume stmct array V 

long major_num; rmajornumber of volume group */ 

stmct uniquejd vg_id; /* volume group id V 
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Struct volgrp *nexlvg; /* pointer to next volgrp structure*/ 

r Array of bits indicating open LVsV 
rAbitperLV V 

irit opnj)inff[(MAXLVS + (NBPI - 1 ))/NBPO; 

pidj von j)id; /* process ID of the varyon process V 

/* Following used in write consistency cache management V 
stmcl volgrp •nxtactvg; T pointer to next volgrp with V 

r write consistency activity V 
struct pv_wait *ca_freepvw; /* head of pvjwait free list V 
struct pv_wait *ca_pvwmem; /*ptr to memory malloced for pvw V 

/* free list V 
struct hdjqueue ca_hld; T head/tail of cache hold queue V 
ulong ca_pv_wrtff[(MAXPVS + (NBPL - 1 )) / NBPL]; 

/* when bit set write cache to PV V 
char cajnflt_cnt; /* number of PV active writing cache*/ 

char capsize; /* number of entries in cache */ 

ushort caj)vwblked; T number of times the pv_wait free 7 

r list has been empty */ 
stnjct mwcjec *mwc_rec; T ptr to part 1 of cache - disk rec*/ 
stniclca.mwc_mp *caj)art2; /*ptr to part 2 of cache -memory */ 
struct ca_mwc_mp *cajst; /*mru/lm cache list anchor */ 
stnjcl ca_mwc_mp *ca_hashff[CAHHSIZE]; T write consistency hash anchors*/ 

/* the following 2 variables are used to control a cache clean up opera-*/ 
/*tion. 

pidJ bcachwait; T list waiting at the beginning */ 

pidJ ecachwait; /* list waiting at the end */ 

volatile int wait_cnt; T count of cleanup waiters */ 

/* the following are used to control the VGSAs and the wheel */ 

uchar quorum_cnt; r Number indicating quorum of SAs */ 

uchar wheeljdx; /* VGSA wheel index into pvols */ 

ushort whLseq_num; /* VGSA memory image sequence number*/ 

stmctpbuf *sa_actjst; /* head of list of pbufs that are */ 

/* actively on the VGSA wheel */ 
stoictpbuf *sa_hld_lst; /* head of list of pbufs that are */ 
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r waiting to get on the VGSA wheel V 
struct vgsa_area *vgsaj)tr; Tptr to in memory copy of VGSA V 
pidj configLwait; TPID of process waiting in the V 

r hd_config routines to modify tfie V 

r memory version of the VGSA V 
struct buf sajbuf ; /* logical buf struct to use to wrt V 

rtheVGSAs V 
struct pbuf saJ)bu^, r physical buf struct to use to wrtV 

rtheVGSAs V 

); 
r 

* Defines for flags field in voigrp structure 

V 

#define VG_SYSMGMT 0x0002 / *VG is on for system management V 

r only commands */ 
#define VG_RDRCEDOFF 0x0004 r Should only be on when the VG was*/ 
#define VG_OPENING 0x0008 T VG is being varied on V 
/* forced varied off and there were LVs still open. Under this con-V 
r dition the driver entry points can not be deleted from the deviceV 
r switch table. Therefore the voigrp stmcture must be kept V 
r around to handle any rogue operations on this VG. V 
#define CAJNFLT 0x0010 r The cache is being written or V 

r locked */ 
#defineCA_VGACT 0x0020 /* This voigrp on mwc active list 7 
#define CA_HOLD 0x0040 T Hold the cache in flight V 

^Idefine CA_FULL 0x0080 /* Cache is full -no free entries */ 

#define SA_WHL_AGT 0x0100 T VGSA wheel is active V 
#define SA_WHL_HLD 0x0200 /* VGSA wheel is on hold V 
#define SA_WHL_WAIT 0x0400 T config function is waiting for V 

r the wheel to stop */ 



r 

* Logical volume structure. 

V 

struct Ivol { 
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Struct but **work.Q; /* work in progress hash table V 
short Ivjstafijs /*lv status: closed, closing, open */ 
short lv_options; /* logical dev options (see below) */ 
short nparts; Tnum of part structures for this V 

riv-base1 

char i_sched; rinitial scheduler policy state V 

char p»i; /* padding so data word aligned V 

ulong nbiocks; T LV length in blocks V 

struct part 'partsff[3]; f partition arrays for each mirror V 

ulong tot_wrts; /* total number of writes to LV V 

ulong totjds; T total number of reads to LV V 

r These fields of the Ivol structure are read and/or written by 

' the bottom half of the LVDD; and therefore must be carefuily 

* modified. 

V 

int complcnt; T completion count_used to quiesceV 
int waitlist; event list for quiesce of LV V 



/*lv status:*/ 

#define LV_CLOSED 0 
#define LV_CLOSING 
#define LV.OPEN 2 

r scheduling policies: V 
#define SCH_REGULAR 0 
#define SCH.SEQUENTIAL 1 
#define SCH_PARALLEL 2 
#define SCH_SEQWRTPARRD 
#define SCH PARWRTSEQRD 



f logical volumes is closed V 
1 r trying to dose the LV */ 
/* logical volume is open V 



r regular, non-min'ored LV V 
r sequential write, seq read */ 
/* paralled write, read closest */ 
/* sequential write, read closest*/ 
r parallel write, seq read */ 



3 
4 



r logical device options: */ 

#define LV.NOBBREL 0x0010 /* no bad block relocation */ 
#define LV_RDONLY 0x0020 /* read-only logical volume */ 
#define LV_DMPINPRG 0x0040 /* Dump in progress to this LV V 
#define LV DMPDEV 0x0080 r This LV is a DUMP device */ 

/* i.e. DUMPINIT has been done */ 
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#define LV^NOMWC 0x0100 /* no mirror write consistency V 

r checking */ 
#define LV_WRITEV WRITEV T Write verify writes in LV V 

r workjQ hasli algorithm - just a stub now V 
#define HD_HASH(Lb) \ 

(BLK2TRK((Lb)-> bjkno) & (W0RKQ_SIZE-1 )) 

r 

* Partition stmcture. 
V 

struct part { 

struct pvol *pvol; V containing physical volu|ne V 

daddrj start V starting physical disk address V 
short syncjrk; /* current LTG being resynced */ 
char ppstale; /* physical partition stale */ 
char sync_msk; Tcun'ent LTG sync mask */ 

}: 

r 

* Physical partition state defines PP_ and structure defines. 

* The PP_STALE and PP_REDUCING bits could be combined into one but it 

* is easier to understand if they are not and a problem arises later. 

* The PP_RIP bit is only vaiki in the primaiy part structure. 

*/ 

#define PP_STALE 0x01 /* Set when PP is stale V 
#define PP_CHGING 0x02 r Set when PP is stale but the V 

I* VGSAs have not been completely V 

r updated yet */ 
#define PP_REDUCING 0x04 T Set when PP is in the process V 

/* of being removed(reduced out */ 
#define PP_RIP 0x08 /* Set when a Resync is in progress */ 

/* When set "syncjrk" indicates V 

r the track being synced. If V 
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rsync_trknot= = -1andPP_RIP V 
r not set syncjrk is next Irk -I 
/* to be synced */ 
#define PP_SYNCERR 0x1 0 /* Set when error in a partition V 

r l)eing resynced. Causes tlie V 
r partition to remain stale. V 

#defineNO_SYNCTRK -1 rTheLPdoesnothavearesync V 

Tin progress V 

r 

Physical volume structure. 

Contains defects directory hash anchor table. The defects . 
directory is hashed by track group within partition. Entries within 
each congruence dass are sorted in ascending block addresses. 

This scheme doesnl quite work, yet. The congmence classes need 
to be aligned with logics track groups or partitions to guarantee 
that all blocks of this request are checked. But physical addresses 
heed not be aligned on track group boundaries. 

'/ 

#defineHASHSIZE 64 /* number of defect hash classes*/ 

struct dsfBct tbi { 

stmcrbad_blk*defectsff[HASHSIZE]; r defect directory anchor V 

}: 

stmctpvol{ 

devj dev; /* devj of physical device */ 

daddrj armpos; /* last requested ami position 
short xfcnt; /"transfer count for this pv V 

short pvstate; /*PV state V 
short pvnum; /* LVM PV number 0-31 V 

short vg_num; /* VG major number •/ 

stmctfile* fp; r file pointer from open of PV V 
char flags; /* place to hold flags V 
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char pad; /* unused V 

short num_bbdir_ent; /* current number of BBDir entries*/ 

daddrj fst_usr_blk; r first available block on the PV V 

/•for user data */ 
daddrj begLrelblk; T first bikno in reloc pool */ 
daddrj nextjeiblk; r blkno of next unused relocation */ 

r blxk in reioc bik pool at end V 
rofPV V 
daddrt maxjelblk; riargest blkno avail for reloc V 
struct defectjbl *defectjbl; r pointer to defect table */ 
struct hdj:apvq caj)v; T head/tail of queue of request V 

r waiting for cache write to V 
/•complete V 
stmct saj)v_whl { /* VGSA infomiation for this PV */ 

daddrj Isn; /* SA logical sector number - LV 0 7 
ushort sa_seqjium; ?SA wheel sequence number 7 
char nukesa; /" flag set if SA to be deleted 7 

char pad; /"pad to full long word 7 
) sa_areaff[2]; /* one for each possible SA on PV 7 

stmct pbuf pvjjbuf ; T pbuf struct for writing cache 7 

1; 

r defines for pvstate field 7 

#define PV-MISSING 1 /* PV cannot be accessed 7 
#define PV-RORELOC 2 T No HW or SW relocation allowed 7 

r only known bad blocks relocated 7 

r 

* returns index into the bad block hash table for this block number 
7 

#define BBHASH JND(blkno) (BLK2TRK(blkno) & (HASHSIZE - J )) 



* Macro to return defect directory congmence class pointer 
7 

#define HASH_BAD(Pb,Bad_blkno) \ 

((Pb)-> pbj)vol-> defect_tbl-> defectsff[BLK2TRK(Bad_blkno)&(HASHSIZE-1)]) 
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* Used by the LVM dump device routines same as HASH JAD but the first 

* argument is a pvol struct pointer 
V 

#defineHASH BAD DMP(Pvol,Blkno) \ ....^uc.Tc.m 
. ((Pvolb defS:tJbl->defectsfllBlJ(2TRK(Bllaio)&(HASH 

r 

* Bad block directoiy entry. 

struct bad_blk{ /* bad block directory entry / 

struct bad.blk *next; /* next entry in congruence class 

devj dev; /* containing physical device J 

daddrj bikno; /* bad physical disk address */ 

unsigned status: 4; /* relocation status (see below) */ 
unsigned reblk: 28; /* relocated physical disk address */ 

}; 

rbad block relocation status values: V 

#defineREL_DONE 0 /* software relocation completed / 

#define REL PENDING 1 /* software relocation in progress V 
#define REL.DEVICE 2 /* device (HW) relocation requested / 
#define REL^CHAINED 3 /* relocation bik structure exists */ 
#define REL_DESIRED 8 T relocation desired-hi order bit onV 

r 

* Macros for getting and releasing bad block stmctures from the 

* pool of bad_blk structures. They are linked together by their next pointers. 

* "hdjreebad" points to the head of bad-blk free list 

* NOTE: Code must check if hd_freebad 1= null before calling 

the GET_BBU< macro. 

V 

#defineGET_BBLK(Bad) {\ 

(Bad) = hd_freebad;\ 
hdjreebad = hd_freebad-> next; \ 
hdjreebad j:nt--;\ 

1 
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#define REL_BBLK(Bad) {\ 

(Bad)->next=hd_freebad;\ 
hdjreebad = (Bad); \ 
hdJreebadjcnt++;\ 

)" 



* Macros for accessing these data structures 
V 



#defineV6_DEV2LV(Vg, Dev) 
#defineVG_DEV2PV(Vg.Pnum) 

iWefineBLK2PART(Pshift,Lbn) 
#definePART2BLK(Pshifl.P no) 
Idefine PARrmON(LV,P_no;Mir) 



((VG)-> lvolsff[minor(Dev)I 
((Vg)-> pvolsff[(Pnum)]) 



((ulong)(Lbn)>>(Pshift)) . 
((Pno)<<(Pshift)) 
((Lv)->partsff[(Mir)l + (P_no)) 



* Mirror bit definitions 
V 

#define PRIMARY_MIRROR 
#d6fine SECONDARY MIRROR 
#define TERTIARY MfRROR 
#defineALL MIRRORS 



001 r primary mirror masl^ 

002 r secondary mirror mask 
004 r tertiary mm masl< 

007 r masl( of ail mirror bits */ 



V 



V 



r macro to extract mirror avoidance mask from ext parameter */ 

#define X JVOID(Ext) ( ((Ext) > > AVOID_SHFT) & ALL_MIRRORS ) 



Macros to select mirrors using avoidance masks: 

FIRST MIRROR retums first unmasked min-or (0 to 2); 3 if all masked 
FIRSTlMASK returns first masked mirror (0 to 2); 3 if none masked 
MIRROR_COUNT retums number of unmasked mirrors (0 to 3) 
MIRROR.MASK retums a mask to avoid a specific mirror (1 , 2, 4) 
MIRRORJXIST retums a mask for non-existent mirrors (0. 4, 6, or 7) 
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#define FIRST_MIRROR(Mask) ((0x30102010 > > ((Mask) < < 2))&0x0f) 

#define FIRST MASK(Mask) ((0x01020103 > > ((Mask) < < 2))&0x0f) 

#d8fine MIRROR_COUNT(Mask) ((0x01121223 » ((Mask) <<2))&0x0t) 

#define M IRROR_EXIST(Nmirrors) ((0x00000467 > > ((Nmirrors) < < 2))&0x0f) 

#define MIRROR_MASK(Mirror) (1 < < (Mirror)) 

r 

* DBSIZE and DBSHIFT were originally UBSIZE and UBSHiFT from param.h. 

* There were renamed and moved to here to more closely resemble a disk 

* block and not a user block size. 
*/ 

#define DBSIZE 512 T Disk block size in bytes V 

#define DBSHIR 9 nog2ofDBSIZE ^ *l 

r 

* LVPAGESIZE and LVPGSHIR were originally PAGESIZE and PGSHIF from param.h. 

* There were renamed and moved to here to isolate LVM from the changable 

* system parameters that would have undesirable effects on LVM functionality. 

7 

#define LVPAGESIZE 4096 T Page size in bytes V 
#deflne LVPAGESHIF 12 riog 2 of LVPAGESIZE */ 

#define BPPG (LVPAGESIZE/DBSIZE) r blxks per page V 
#define BPPGSHIFT (LVPGSHIFT-DBSHIFT) f log 2 of BPPG V 
#define PGPTRK 32 1* pages per logical track group V 

#defineTRKSHIF 5 riog base 2 of PGPTRK V 

#define LTGSHIFT (TRKSHIF 4 BBGSHIFT)/* logical track group log base 2*1 
#define BYTEPTRK PGPTRK*LVPAGESIZE /* bytes per logical track group V 
#define BLKPTRK PGPTRK*BPPG rWocks per logical track group*/ 

#defineSIGNED„SHIFTMSK 0x80000000 /* signed mask for shifting to */ 

V get page affected mask V 

#define BLK2BYTE(Nblocks) ((unsigned)(Nblocks) < < (DBSHIFT)) 
#define BYTE2BLK(Nbytes) ((unsigned)(Nbytes) > > (DBSHIFT)) 
#define BLK2PG(Blk) ((unsigned)(Blk) > > BPPGSHIR) 

#define PG2BLK(Pageno) ((Pageno) < < (LVPGSHIF-DBSHIFT)) 

#define BLK2TRK(Blk) ((unsigned)(Blk) > > (TRKSHIFT + BPPGSHIFT)) 

#define TRK2BLK(T_no) . ((unsigned)(T_no) < < (TRKSHIF + BPPGSHIFT)) 
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#define PG2TRK(Pageno) ((unsigned)(Pageno)) > > (TRKSHIFT)) 



r LTG per partition */ 

#define TRKPPART(Pshifl) ((unsigned)(1 < < (Pshift -LTGSHIFT))) 
r LTG in the partition 7 

#defineTRKJN^PART(Pshift. BIk) ( BLK2TRK(Blk) & frRKPPART(Pshift) -1) ) 



V defines for top half of LVDDV 
#defineLVDD_HFREE_BB 
#defineLVDD_LFREE_BB 
#defineWORKQ_SIZE 64 
#define PBSUBPOOLSIZE 16 
#define HD_ALIGN (uint)O 
#define FULL.WORDMASK 
#define BUf=CNT . 3 

Tstructsto 



30 Thigh water mark for kernel bad_blk struct*/ 
15 riow water mark for kernel bad_blkstnjclV 
r size of LVs work in progress queue V 
I* size of pbuf subpool alloc'd by PVs V 
r align characteristics for alloc'd memory V 
3 /"mask for full word (log base 2) */ 
r parameter sent to uphysio for # but */ 
allocate V 



#defineNOMIRROR0 
#definePRIMMIRROR 
#defineSINGMIRROR 
#defineDOUBMIRROR 

#define MAXNUMPARTS 
#define PVNUMVGDAS 



/* no mirrors */ 

0 r primary mirror*/ 

1 /* one mirror */ 

2 /* two mirrors */ 

3 /* maximum number of parts in a logical part */ 
2 r max number of VGDA/VGSAs on a PV */ 



r return codes for LVDD top 1/2 */ 
#define LVDD_SUCCESS 0 
#defineLVDD ERROR -1 
#define LVDDlNOALLOC 

#endif r_H_DASD*/ 
r HD.H */ 



r general success code */ 

r general en-or code */ 

•200 /* hd init: not able to allocate pool of Ms* I 



#ifndef H HD 
#define"H"HD 
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* COMPONENT^NAME: (SYSXLVM) Logical Volume Manager Device Driver • hd.h 

* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* Ail Rights Reserved 



#inciude<sys/errids.h> 



* LVDD internal macros and extern statidy declared variables. 



r LVM internal defines; V 
#define FAILURE 
#define SUCCESS 
#defineMAXGRABLV 
#defineMAXSYSVG3 
#define CAHEAD 
#define CATAiL 
#defineCA_MISS 
#defineCA_Hrr 
KdefineCA LBHOLD 



0 r must be logic FALSE for 'if tests */ 
1 r must be logic TRUE for 'if tests V 
16 r Max number of LVs to grab pbuf stmcts V 
r Max number of VGs to grab pbuf structs 

1 r move cache entry to head of use fist V 

2 r move cache entry to tail of use list */ 

0 r MWC cache miss */ 

1 r MWC cache hit */ 

2 r The logical request should hold V 



r 

* Following defines are used to communicate with the kernel process 
V 

#define LVDD_KP JERM 0x80000000 /* Terminate the kernel process 7 
#define LVDD^KP_BADBLK 0x40000000 /* Need more bad_blk structs */ 
#define LVDD_KP_ACTMSK OxCOOOOOOO T Mask of all events V 

r 

* Following defines are used in the b_options of the logical buf struct. 

' They should be reserved in Ivdd.h in relationship to the ext parameters 

7 

#define REQJN_CACH 0x40000000 T When set in the Ibuf b_options 7 

r the request is in the mirror 7 
r write consistency cache 7 
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^define REQJGSA 0x20000000 T When set in the Ibuf b_options V 

/* it means this is a VGSA write V 
r and to use the special saj)buf V 
/* in the voigrp structure V 

********************** *«*t*«**«*t***t*******t* 

* The following variables are only used in the kernel and therefore are 

* only included if the _KERNEL variable is defined. 

#ifdef .KERNEL 
include < sys/syspeslh > 

r 

I Set up a debug level if debug turned on 

#ifdef DEBUG 
#ifdefLVDD_PHYS 
BUGVDEF(debuglvl. 0) 

BUGXDEF(debuglvl) 

#endif 

#endif 

r 

* pending queue 

This is the primary data stmcture for passing work from 

* the strategy routines (see hdjtraLc) to the scheduler 
(see hd_sched.c) via the mirror write consistency logic. 
From this queue the request will go to one of three other 
queues. 

* 1 . cache hold queue - If the request involves mirrors 

and the write consistency cache is in flight. 

FIG. 21-68 

103 



5^^S00CID: <EP 0482eS3A2.l.> 



EP0 482 853 A2 



* i.e. being written to PVs. 

2. cache PV queue -if the request must wait for the 

* write consistency cache to be written to ttie PV. 

* 3. schedule queue - Requests are scheduled from this 

* queue. 

* This queue is only changed within a device driver critical section. 

7 

#ifdefLVDD_PHYS 

struct hdjqueue pendingLQ 

#else 

extern struct hd_queue pending^Q; 
#endif 

r 

* ready queue - physical requests ttiat are ready to start 

* This queue is only valid within a single critical section. 

* It really contains a list of pbuf s, but only the imbedded 

* buf struct is of interest at this point. Since the pointers 

* are of type (struct buf *) it is convenient that the queue be 

* declared similarly. 
V 

#ifdefLVDD_PHYS 

struct buf *ready_Q = NULL; 

#else 

extern struct buf VeadyjQ; 
#endif 

r 

* Chain of free and available pbuf stmcts. 

7 

#ifdef LVDDJHYS 

stnjct pbuf *hd_freebuf = NULL; 
#else 
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extern struct pbuf *hd_freebuf ; 
#endif 

r 

* Chain of pbuf stnjcts currently allocated and pinned for LVDD use. 

* Only used at dump time and by crash to find them. 
V 

#ifdefLVDD_PHYS 

stnjctpbuf 'hdJmpbuf^NUU: 
#else 

extem stnjctpbuf *hd dmpbuf; 
#endif 

r 

* Chain and count of free and available bad_blk structs. 

* The first open of a VG, really the first open of an LV, will cause 

* LVDD__HFREE_BB( currently 30 ) bad_blk stnjcts to be allocated and 

* chained here. After that when the count gets to LVDD_.LFREE_BB(low 

* water mark, currently 15) the kernel process will be kicked to go 

* get more up to LYDD_HFREE_BB( high water mark ) more. 

* *NOTE* hdjreebadjk is a lock mechanism to keep the top half of the 

* driver and the kernel process from colliding. This would only 

* happen if the last request before the last LV closed received 

an ESOFF or EMEDIA( and request was a write ) and the getting of 

* a bad_blk struct caused the count to go below the low water 
mark.~This would result in the kproc ?ying to put more 

* structures on the list while hdjclose via hdjrefrebb would 

* be removing them. 

#ifdef LVDD_PHYS 

int hdjreebadjk = LOCK_AVAIL; 

stnict bad_blk *hd_freebad = NULL; 
int hdjreebad_cnt=0; 
#else 

int hdjreebadjk; 
extem struct bad_blk *hd_freebad; 
extem int hdjreebad_cnt; 
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#endif 

r 

* Chain of volgrp structs that have write consistency caches that need 

* to be written to PVs. This chain is used so all incoming requests 

* can be scanned before putting the write consistency cache in flight 

* Once in flight the cache is locked out and any new requests will have 

* to wait for all cache writes to finish. 
V 

#ifdef LVDD_PHYS 

struct volgrp *hd_vgLmwc = NULL; 

#else 

extern struct volgrp *hd_vgLmwc; 
#endif 

r 

* The following an'ays are used to allocate mirror write consistency 

* caches in a group of 8 per page. This is due to the way the hide 

* mechanism worl^ only on page quantities. These two an'ays should be 

* treated as being in lock step. The lock, hd^cajock, is used to 

* ensure only one process is playing with the an'ays at any one time. 
V 

#define VGS.CA ((MAXVGS+(NBPB-1))/NBPB) 
#ifdefLVDD^PHYS 

lockj hdjjajock = LOCK_AVAIL; r lock for cache arrays V 

char ca_allocedff[VGS_CA]; /* bit per VG with cache allocated V 
struct mwcjec *cajgrp_ptrff[VGS_CA]; n foreachSVGs V 

#else 

extern lockj hd_caJock; 
extemchar ca_allocedff[]; 
extem struct mwcjec *cajgrpj)trff[]; 
#endif 

r 

* The following variables are used to control the number of pbuf 

* stmctures allocated for LVM use. It is based on the number of 
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* PVs in varied on VGs. The first PV gets 64 stmctures and each 

* PV thereafter gets 1 6 more. The number is reduced only when a 

* VG goes inactive. i.e. all ifs LVs are closed. 
V 

#ifdefLVDD_PHYS 

inthdj)buf_cnt=0; /"Total Number of pbufs allocated V 

int hdj)buf jgrab = PBSUBP00LSI2E; 1* Number of pbuf structs to allocate*/ 

r for each active PV on the system */ 

int hdj)buf_min = PBSUBPOOLSIZE * 4; 

r Number of pbuf to allocate for the */ 
r first PV on the system V .. 

int hd_vgs_opn = 0; I* Number of VGs opened */ 

inthdjvs_opn = 0; f Number of LVs opened V 

int hdj)vs_opn = 0; r Number of PVs in varied on VGs */ 

int hdj)buf Jnuse = 0; I* Number of pbufs cunently in use */ 

int hd j)buf_maxuse = 0; /* Maximum number of pbufs in use during*/ 

/* this boot */ 

#else 
extern 
extern 
extern 
extern 
extem 
extern 
extem 
extem 
#endif 



nthdj)bufj;nt; 
nthd_pbuf_grab; 
nt hdj)buf_min; 
nt hd_vgsj)pn; 
nt hdJvs_opn; 
nt hd_pvs_opn; 
nt hd_pbufjnuse; 
nt hd_pbuf_maxuse; 



r 

* The following are used to update the bad block directoiy on a disk 
*/ 

ifdef LVDD PHYS 
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stiuct pbuf *bb j)buf; 1* ptr to pbuf reserved for BB dir updating V 
struct hd_queue bb.hid; T holding Q used when there is a BB V 

r directory update in progress 7 

#eise 

extern stnjct pbuf *bbj)buf ; 
extern struct hd queue bb hid; 
#endif 

r 

* The following variables are used to communicate between the LVDD 

* and the kernel process. 
V 

#ifdef LVDD.PHYS 

pidj hdjpid =0; TPID of the kemel process ' V 
#else 

extern pidJ hd kpid; 
#endif 



r 

* The following variables are used in an attempt to keep some information 

* around about the perforniance and potential bottle necks in the driver. 
I Currently these must be looked at with crash or the kernel debugger. 

#ifdef LVDD_PHYS 

ulong hd_pendqblked =0;r How many times the scheduling queue V 

r (pending_Q) has been block due to no V 
r pbufs being available. V 

#else 

extern ulong hdoendablked; 
#endif 



* The following are used to log error messages by LVDD. The dejata 

* is defined as a general 16 l^e array, BUT, ifs actual use is 

* totally dependent on the error type. 

V 

#define RESRC_NAME "LVDD" T Resource name for error logging 7 
struct hd^errlogLent { /* Error log entry stmcture 7 
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Struct irrjecO id; 
charde datatffflG]; 

1: " 

r macros to allocate and free pbuf structures V 
#defineGET_PBUF(PB) {\ 

(PB) = hd>ebuf; \ 

hd_freebuf = (struct pbuf *) hdJreebuf-> pb.ay_forw; \ 
hdj)bufJnuse + + ;\ 
if( hdj)bufjnuse > hd_pbuf_maxuse ) \ 
hdj)buf_maxuse = hdj)buf Jnuse; \ 

#define REL_PBUF(PB) {\ 

(PB)->pb.av_foniv= (struct buf*)hd_ffeebuf; \ 

hdfreebuf=(PB);\ 

hd j)bufjnuse~;\ 

1 

/* macros to allocate and free pv_wait structures V 

#define GET_PVWAIT{Pvw, Vg) {\ 

(Pvw) = (Vg)-> cajreepvw; \ 

(Vg)-> ca_freepvw = (Pvw)-> nxt_pv_wait; \ 

#define REL_PVWAIT(Pvw, Vg) { \ 

(Pvw)->nxtj)vwait=:(Vg)->ca_freepvw; \ 
(Vg)-> cajreepvw = (Pvw); \ 

#define TST^PVWAIT(Vg) ((Vg)-> cajeepvw = = NULL 

r 

* Macro to put volgrp ptr at head of the list of VGs waiting to start 

*MWC cache writes 

V 

#define CA_VG_WRT( VG ) {\ 

if( !((Vg)-> flags &CA_VGAGT))\ 
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(Vg)-> nxtaclvg = hd_vg_mwc; \ 
hd_vgLmwc = (Vg);\ 
(Vg)->flags| = CA_VGACT:\ 



r 

* Macro to determine if a physical request should be returned to 

* the scheduling layer or continue(resume). 
V 

#define PB_CONT( PB ) {\ 
if(((Pb)-> pb.addr = = ((Pb)-> pbJbuf-> b>ddr + (Pb)-> pbJbuf-> b_bcount)) || \ 
((Pb)->pb.b flags &B_ERROR))\ 
HD_SCHED((Pb));\ 

else\ 

hd_resume( (Pb)):\ 

r 

* HD_SCHED - invoke scheduler policy routine for this request 

* For physical requests it invokes the physical operation end policy. 
#define HD_SCHED(Pb) r(Pb)-> pb_sched)(Pb) 



r define for b_error value (only used by LVDD) 7 
#define ELBBLOCKED 255 r this logical request is blocked by V 

r another on in progress V 

#endif /* .KERNEL 7 

r 

* Write consistency cache structures and macros 
7 

r cache hash algorithms - returns index into cache hash table 7 

#define CA_HASH(Lb) (BLK2TRK((LB)->b bikno) & (CAHHSIZE-1 )) 

#define CA_THASH(Trk) ((TRK & (CAHHSIZE-1 )) 
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r 

* This structure will generally be referred to as part 2 of tfie cadie 
V 

struct ca_mwcjnp { T cache mirror write consistency memory only part V 

struct cajnwc.mp *hq_next; Tptr to next hash queue entry V 
char state; r State of entry V 

char pad1; T Pad to word V 

ushort locnt; T Non-zero -io active to LTG V 

struct ca_mwcjdp *part1 ; /* Ptr to part 1 entry - ca_mwc_dp V 
stmctca_mwc_mp *next; T Next memory part struct V 
stnjctcajnwcjnp *prev; r Previous memory part struct */ 

}; 

r cajnwcjnp state defines */ 

ifdefine CANOCHG 0x00 r Cache entry has NOT changed since last *l 

/* cache write operation, but is on a hash V 
r queue somewhere V 

#define CACHG 0x01 /* Cache entry has changed since last cache V 

I* write operation V 

#define CACLEAN 0x02 T Cache entry has not t)een used since last V 

r clean up operation V 

r 

* This stmcture will generally be refen-ed to as part 1 of the cache 

* In order to stay long word aligned this stmcture has a 2 byte pad. 

* This reduces the number of cadie entries available in the cadie. 
V 

struct ca_mwc_dp ( /* cache mirror write consistency disk part */ 

ulong Ivjlg; TLY logical track group V 

ushort lv_minor; T LV minor number V 

short pad; 

1: 

#define MAX_CA_ENT 62 T Max number that will fit in block V 
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r 

* This structure must be maintained to be 1 block in length(512 bytes). 

* TNs also implies the maximum number of write consistency cache entries. 

7 

struct mwc_rec{ Tminror write consistency disk record V 

stmct timestrucj bjtmstamp; T Time stamp at beginning of block */ 

stnjct ca_mwc_dp ca_p lff[MAX_CA_ENT|; /* Reserve 62 part 1 structures V 

stnjcttimestnjcj ejtmstamp; /* Time stamp at end of block 7 

}» 

r 

* This structure is used by the MWCM. It is hung on the PV cache write 

* queues to indicate whidi Ibufs are waiting on any particular PV. The 

* define controls how much memory to allocate to hold these structures. 

* The algorithm is 3 * CA__MULT * cache size * size of structure. 
7 

#defineCA_MULT 4 /* pv_wait * cache size multiplier 7 

stmct pv_wait{ 

stmct pv_wait *nxt_pv_wait; /* next pvjvaitstmcture on chain 7 
stmct but *lb_wait; Tptr to Ibuf waiting for cache 7 

}; 



* LVM function declarations - an^anged by module in order by how they occur 

Mn said module. 

7 

#ifdef_KERNEL 
#lfndef_NO_PROTO 

r hd_mircach.c 7 

extem int hd_ca_ckcach ( 

register stmct but *lb. /* current logical buf stmct 7 

register stmct volgrp *vg, Tptrtovolgrpstmcture 7 

register stmct Ivol *lv); /*ptrtolvolstmcture 7 
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extern void hd__ca_use ( 

register struct volgrp *vg, /* ptr to volgrp structure V 
register struct ca_mwc_mp *ca_eiTt/ cadie entry pointer V 
register int hj); f head/tail flag V 

extern stmctca_mwc_mp *hd__ca_new( 

register struct volgrp *vg); r ptr to volgrp structure V 

extern void hdjcajwrt (void); 

extern void hd_ca_wend ( 

register struct pbuf'pb); /* Address of pbuf completed *l 

extern void hd_ca_sked( 

register struct volgrp *vg, T ptr to volgrp structure V 
register struct pvol *pvol); r pvol ptr for ttiis PV 

extern struct ca_mwc_mp *hd_ca_fnd( 

register struct volgrp *vg, T ptr to volgrp structure V 
register struct buf *lb); /* ptr to Ibuf to find ttie entiyV 

r for V 

extern void tid_ca_clnup( 

register stmct volgrp *vg); /* ptr to volgrp structure V 

extern void hd_ca_qunlk( 

register struct volgrp *vg, /*ptr to volgrp structure V 
register struct ca_mwc_mp *ca_ent); r ptr to entry to unlink 7 

extern int hd_ca_pvque( 

register struct buf *lb, /* current logical buf struct V 

register stmct volgrp *vg, T ptr to volgrp structure V 

register struct Ivol *lv); Tptrtolvolstmcture V 

extern void hd_ca_end ( 

registeTstnjct pbuf *pb); V physical device buf struct V 

extern void hd_ca_temi ( 
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register Struct but *lb); /* current logical buf struct V 

extern void hdjca_mvhld ( 

register struct volgrp *vg); T ptr to volgrp structure */ 

rhdjdump.c*/ 

extern int hdjdump( 

devj dev, T major/minor of LV */ 

struct uio *uiop, r ptr to uio struct describing operation*/ 

int cmd, r dump command 

char *arg, /* cmd dependent - ptr to dmp_query stmctV 

int Chan, /'not used » ^ */ 

int exl);/* not used */ 

extem int hdjimpxlate( 

register devJ dev, T major/minor of LV */ 

register struct uio *luiop, T ptr to logical uio structure */ 

register stmct volgrp *vg); /* ptr to VG from device switch table*/ 

rhd_top.cV 

extem int hd_open( 

devJ dev, 1* device number majonminor of LV to be opened V 
int flags, T read/write flag r 
int Chan, /* not used T 
intext); T not used /* 

extem int hd_aiiocpbuf(void); 

extem void hdjjbufdmpq( 

register struct pbuf *pb, /* new pbuf for chain 7 
register struct pbuf **qq); /* Ptr to queue anchor 7 

extem void hd_openbkout( 

int bopoint, /* point to start backing out 7 
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Struct volgrp *vg); /* struct volgrpptr V 

extern void hd_backout( 

int bopoint/ point where error occurred & need to V 
r backout all structures pinned before V 
r this point */ 
struct Ivol *lv, Tptr to ivoi to backout V 
struct volgrp *vg); /"struct volgrpptr */ 

extern int hd_close( 

devj dev. r device number major.minor of LV to be closed */ 
int Chan, T not used*/ 
int ext);r not used V 

extern void hd_vgcleanup( 

struct volgrp \g); T struct volgrpptr V 

extern void hdJrefrebb(void); 

extern int hd_allocbblk(void); 

extern inthd_read( 

devJ dev, /*num major.minor of LV to be read V 

struct uio*uiop, T pointer to uiostmcture that specifies V 
f k)cation & length of caller's data buffer*/ 
int chan, /* not used*/ 
int ext); f extension parameters */ 

extern int hd_write{ 

devJ dev, T num major.minor of LV to be written */ 

struct uio *uiop, T pointer to uio structure that specifies */ 
/* location & length of caller's data buffer*/ 
int chan, T not used */ 
int ext); T extension parameters */ 

extern int hd_mincnt( 

stmctbuf *bp. Tptr to pbuf struct to be checked 7 
void *minparms); /* ptr to ext value sent to uphysio by*/ 
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r hdjead/hdjvrite. 7 

extern int hdJoctl( 

devj dev. /* device number major.minor of LV to be opened */ 
int cmd/ specific iodl command to be performed V 
int arg, Taddr of parameter bik for the specific cmd *l 
int mode, T request origination V 
int chan, /* not used V 
int ext);/* not used*/ 

extern stmct mwcjec ' hd_aiioca(void); 

extern void hdjdealioca( 

register struct mwcjec *caj)tr); /* ptr to cache to free V 

extem void hd_nodumpvg( 
struct volgrp *); 

r hd_phys.c */ 

extem void hd__begin( 

register struct pbuf *pb, I* physical device buf struct V 
register struct volgrp *vg); T physical to volgrp struct 7 

extem void hd_end( 

register struct pbuf *pb); /* physical device buf struct 7 

extem void hdjesume( 

register struct pbuf *pb); T physical device buf struct 7 

extem void hdjeady( 

register stmct pbuf *pb); /* physical request buf 7 

extem void hd_start(void); 

extem void hd jgettime( 
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register struct timestrucj *o Jme); /* oldtime */ 
r hd_bbrel.c V 

extern inthd_chkblk( 

register struct pbuf *pb); T physical device buf struct 7 

extern void hdj)bend( 

register struct pbuf *pb); T physical device buf struct V 

extern void hd_baddone( 

register struct pbuf *pb); /* physical request to process V 

extern void hd_badblk( 

register struct pbuf *pb); 1* physical request to process V 

extern void hd_swreloc( 

register struct pbuf *pb); 1* physical request to process V 

extern daddrj hd_assignalt( 

register struct pbuf *pb); /* physical request to process V 

extern struct bad_blk *hd_fndbbrel( 

register struct pbuf *pb); T physical request to process */ 

extern void hd nqbblk( 

register struct pbuf *pb); /* physical request to process */ 

extern void hd_dqbblk( 

register stnjct pbuf *pb, f* physical request to process *l 
register daddrj blkno); 

/*hd sched.c7 



extern void hd_schedule(void); 

extern int hd_avoid( 

register struct buf *lb, /* logical request buf 
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register struct volgrp *vg); r VG volgrp ptr V 

extern void hdjesyncpp( 

register struct pbuf *pb); T physical device but struct V 

extern void lid_freshpp( 

register struct volgrp *vg, /* pointer to volgrp stnjct V 
register stnjct pbuf 'pb); /* physical request but V 

extern void hd_mirread( 

register stnjct pbuf *pb); /* physical device buf struct *l 

extern void hd Jxup( 

register stnjct pbuf *pb); /* physical device buf struct V 

extern void hd_stalepp( 

register stmct volgrp *vg, /* pointer to volgrp stmct 7 
register stmct pbuf *pb); r physical device buf stmct 7 

extern void hd_staleppe( 

register stmct pbuf *pb); /* physical request buf 7 

extern void hd_xlate( 

register stmct pbuf *pb, T physical request buf 7 

register int mirror, T mirror numtler 7 

register stmct volgrp *vg); /* VG volgrp ptr 7 

extern inthd_regular( 

register stmct buf *lb, riogical request buf 7 
register stmct volgrp *vg); /* volume group stmcture 7 

extem void hd_finished( 

register stmct pbuf *pb); T physical device buf struct 7 

extem int hd__sequential( 

register stmct buf *lb, riogical request buf 7 
register stmct volgrp *vg); /* volume group stmcture 7 
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extern void hd_seqnext( 

register struct pbuf *pb); 1* physical request but */ 
register struct volgrp *vg);/*VGvolgrp pointer V 

extern void hd_seqwrite( 

register struct pbuf *pb); I* physical device buf struct V 

extern inthdj}arallel( 

register struct buf % Tlogical request buf V 
register struct volgrp *vg); /* volume group stmcture V 

extern void hdjreeall( 

register struct pbuf *q); write request queue ' *l 

extern void hd_append( 

register struct pbuf *pb, f physical request pbuf V 
register struct pbuf **qq); /* Ptr to write request queue anchor V 

extern void hd_nearby( 

register struct pbuf *pb, T physical request pbuf 7 

register struct buf *lb, /* logical request buf V 

register int mask, /* mirrors to avoid V 

register struct volgrp *vg, T volume group stmcture V 
register struct Ivol *lv); 

extern void hd_pan«rite( 

register struct pbuf ^pb); T physical device buf stmct V 

r hd jtralc V 

extern void hd_strategy( 

register struct buf *lb); /* input list of logical buf stmcts V 

extern void hdjnitiate( 

register struct buf *lb); . /* input list of logical buf s V 
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extern Struct buf *hdjeject( 

struct buf *lb, /* offending buf structure */ 
int ermo); /* error number V 

extern void hdjquiescevg( 

stnjctvolgrp 'vg); /* pointer from device switch table V 

extern void hdjquiet( 

devj dev, /* number major.minor of LV to quiesce */ 

stmctvolgrp *vg); /*ptr from device switch table V 

extern void hd_redquiet( 

devJ dev, /* number major,minor of LV V 

stmct hd Jvred Vedjst); /* ptr to list of PPs to remove 7 

exteminthd_add2pool( 

register stmct pbuf *subpool, 1* ptr to pbuf sub pool V 
register stmct pbuf*dmpq); /*ptr to pbuf dump queue V 

extem void hd_deallocpbuf(void); 

extern int hd_numpbufs(void); 

extem void hdjerminate( 

register stmct buf lb); /* logical buf stmct V 

extem void hd_unblock( 

register stmct buf *next, /* first request on hash chain V 
register stmct buf *lb); /* logical request to reschedule*/ 

extem void hd_quelb ( 

register stmct buf *lb. /* current logical buf struct V 
register stmct hd_queue*que); T queue structure prt 7 

extem int hd_kdisjnitmwc( 

stmct volgrp Vg); r volume group pointer 7 
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extern int hd_kdis__dswadd( 

register devj device, /* device number of the VG V 

register struct devsw ^devsw);/* address of tiiedevsw entry */ 

extern int hd_.kdis_chgqrm( 

struct volgrp *vg r volume group pointer */ 
short newqrm); 1* new quorum count V . 

extem int hd_kproc(void); 

r hd_vgsa.c */ 

extem int hdja_strt( 

register struct pbuf *pb. T physical device buf struct *l 

register struct volgrp *vg, /* volgrp pointer */ 

register int ^fpe); r type of request V 

extem void hd__sa_wrt( 

register stmct volgrp *vg); I* volgrp pointer */ 

extem void hd_saJodone( 

register struct buf *lb); /* ptr to Ibuf in VG just completed V 

extem void hdja_cont( 

register struct volgrp *vg, T volgrp pointer */ 
register int sa_updated); T ptr to Ibuf in VG just completed V 

extem void hd_sa_hback( 

register stmct pbuf *head_ptr, 1* head of pbuf list V 

register stmct pbuf *newj)buf); T ptr to pbuf to append to list *l 

extem void hd_sa_rtn( 

register stmct pbuf *head_ptr, /* head of pbuf list */ 

register int err Jig); /* if tme retum requests with */ 

r ENXIO error V 



extem int hd_sa__whladv( 
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register Struct volgrp *vg, /* volgrp pointer */ 
register intc_whljdx); T current wheel index V 

extern void hd_sa_update( 

register stmct volgrp *vg); /* volgrp pointer 7 

extern int hd_sa_qrmchk( 

register stmct volgrp *vg); /* volgrp pointer V 

extern int hd_sa_config( 

register struct volgrp *vg, /* volgrp pointer V 

register int type, T ^pe of hd_config request V 

register caddrj arg); T ptr to arguments for tiie request */ 

extern int hd__sa_onerev( 

register struct volgrp *vg, /* volgrp pointer V 

register stmct pbuf *pv, T ptr pbuf structure V 

register int type); T type of hdjjonfig request V 

extern void hd_bldpbuf( 

register struct pbuf *pb, /* ptr to pbuf struct */ 
register stmct pvol *pvol, /* target pvol ptr /* 
register int type, /* type of pbuf to build */ 
register caddrj buffaddr,r data buffer address - system V 
register unsigned cnt, /* length of buffer */ 
register stmct xmem *xmem, /* ptr to cross memoiy descriptor*/ 
register void (*sched)()); T ptr to function ret void V 

extern int hd_extend( 

stmct sa_ext *saext); /* ptr to stmcture witli extend info V 

extem void hd_reduce( 

stmct sajed *sared, /* ptr to stmcture witfi reduce info V 
stmct volgrp *vg); /* ptr to volume group stmcture V 

r hd^bbdir.c V 

extem void hd_upd_bbdir( 
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register struct pbuf *pb); I* physical request to process V 

extern void hd_blxiirend( 

register struct pbuf *vgpb); /* ptr to VG bbj)buf */ 

extern void hd_bbdirop( void ); 

extern inthd_bbad( 

register struct pbuf *vgpb); /* ptr to VG bbj)buf V 

extern inthd_bbdel( 

register struct pbuf \gpb): T ptr to VG bbj)buf V 

extern inthd_bbupd( 

register struct pbuf *vgpb); rptrtoVGbb_pbuf V 

extern void hd_cfik_bbhld( void ); 

extern void hd_bbdirdone( 

register struct pbuf *origpb); r physical request to process */ 

extern void hdJogerr( 

register unsigned id, T original request to process V 
register ulong dev, T device numi3ier V 
register ulong argi, 
register uiong arg2); 

#else 

r See above for description of call arguments V 
/* hd_mircach.c V 

extern int hd_ca_ckcaGh ( ); 
extern void hd_ca_use ( ); 

extern stmct ca__mwc__mp *hd_ca_new ( ) ; 
extern void hd_cajwrt ( ); 

extern void hd_ca_wend ( ); 
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extern void hd_ca_sked ( ); 
extern struct ca_mwc_mp *hd_cajnd (); 

extern void hd_ca_clnup ( ); 

extern void hd_ca_qunlk ( ); 

extern Int hd_caj3vque ( ); 

extern void hd_ca_end ( ); 

extern void hd_cajerm ( ); 

extern void hd_ca_mvhld ( ); 

I* hd_dump.c */ 

extern int hd_dump(); 

extern int hd^dmpxiate ( ); 

r hdJop.c V 

extern int hd__open ( ); 

extern int hd_allocpbuf ( ); 

extern void hdjbufdmpq ( ); 

extern void hd__openbkout ( ); 

extern void hdIbackout(); 

extern int hd_close ( ); 

extern int hd_vgcleanup ( ); 

extern void hdjrefrebb ( ); 

extern int hd_allocpbblk ( ); 

extern int hdjead ( ); 

extern int hd_write(); 

extern int hd_mincnt(); 

extern int hdjocti ( ); 
extern struct mwcjec *hd_alloca ( ); 

extern void hd__dealloca ( ); 

extern void [id~nodumpvg ( ); 



r hdj)hys.c 7 

extern void 
extern void 
extern void 
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hd^begin ( ); 
hd_end ( ); 
lidjesume ( ); 
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extern void 


hd ready ( ); 


extern void 


hd_start ( ); 


extern void 


hdjettimeO; 


/ na__DDrei.c / 




extern int 


no cnKoiK ( ), 


extern void 


najDoeno \ j, 


extern void 


no Daaaoneo, 


extern voiq 


nu^UauDIK \ 


extern voio 


nojswreioc [ ), 


extern QdOaM 


nu assigndii\/, 


extern stnjctbad_ 


bik *hdfndbbrel(); 


extern void 


hdnqbbIkO; 


extern void 


hd_dqbblk(); 


rndjscned.c / 




sxiBin VOlU 


hr( e^horlt iIq / \* 
riuJSUflBUUIc \ If 


exiBrn ini 


na_aVoiu \ j, 


extern void 


no_resyncpp ( ), 


GXIBm VOia 


nu_TrBsnpp [ j, 


exiem voia 


nu_iTlliTeau [ j, 


exiBm VOIQ 


nu^Tixup \ /, 


extern void 


nu_siai6pp (), 


BXlBm VOIQ 


no^siaiBpp [ 


BXtSm VOIO 


najdaiB { ), 


exiem ini 


KH rA/it ilor / \* 

nu^rBguiar \ j, 


BXiBrn VOIO 


nu^iinisriBu [ ), 


BXlBm ini 


na_SBC|uBnuai ^ j, 


BXlBm ini 


na_SBC|nBxi [ j, 


extern void 


MOjseqwrite ( ), 


BXIBm im 


nQ_paraii6i [ } , 


eXlBm VOIQ 


najfrBBaii ( ), 


oytam wniH 

CaLCIIi vuiu 




extern void 


hd_nearby ( ); 


extern void 


hdj)antfrite ( ); 
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r hdjstral.c V 



extern void hd_strategy ( ); 

extern void hdjnitiate ( ); 
extern struct buf *hd_reject ( ): 

extern void hd_quiescevg ( ); 

extern void hd_quiet ( ); 

extern void hd jedquiet ( ); 

extern int hd_add2pool ( ); 

extern void hd__deallocpbuf ( ); 

extern int hd__numpbufs ( ); 

extern void hdjerminate ( ); 

extern void hd_unblocl( ( ): 

extern void hd_quelb ( ); 

extern void hd_kdis_dswadd ( ); 

extern void hd_kdis_initmwc ( ); 

extern int hd_kdis_chgqrm ( ); 

extern int hd_kproc ( ); 



r hd_vgsa.c V 

extern int 
extern void 
extern void 
extern void 
extern void 
extern void 
extern int 
extern void 
extern int 
extern int 
extern void 
extern int 
extern void 
extern void 

r hd.bbdir.c V 
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hd_sa_strt(); 

hd_sa_wrt(); 

hd_saJodone ( ); 

hd_sa_cont ( ); 

hd_sa_hback ( ); 

hd_sa_rtn ( ); 
hd_sa_whladv ( ); 

hd_sa_update ( ); 
hd_sa_qmichk ( ); 
hd_sa_config ( ); 

hd_b!dpbuf(): 
hd_extend ( ); 

hdjeduce ( ); 

hd__sa_onerev ( ); 
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extern void hd_upd__bbdir ( ); 

extern void hd_bbd7end ( ); 

extern void hd_bl)dirop ( ); 

extern int hd_bbadd ( ); 
extern int hd_bbdel ( ); 
extern int hd_bbupd(); 
extern void hdjchk^bbhld ( ); 

extern void hd.bbdidone ( ); 

extern void hdjogerr ( ); 

#endif /*_NO_PROTO V 

#endif /* KERNEL V 



#endif/*_H_HDV 
Subject: LVM code 

static ciiarsccsidff[] = "@(#)hd vgsa.c1.4com/sysx/lvm,3.1.1 10/11/90 1859:17"; 

r 

* C0MPONENT_N/\ME; (SYSXLVM) Logical Volume Manager Device Driver - hd jgsa.c 

* FUNCTIONS; hd_sa_stit, hdja_wrt, hd_sajodone, hd ja_cont, hd_sa_hback, 

* hd_sa_rtn, hd_sa_whladv, hd_sa_update, hd_sa_qrmchk, 
hd_sa_config, hd^bldpbuf, hd_sa_onerev, hd jeduce, hd_extend, 

' © COPYRIGHTIntemational Business Machines Corp. 1989, 1990 

* All Rights Reserved 

V 

r 

* hd_vgsa.c -- LVM device driver Volume Group Status Area support routines 



These routines handle the volume Group Status Area(VGSA) used 
to maintain the state of physical partitions that are copies of each 
other. The VGSA also indicates whether a physical volume is missing. 
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Function: 

Execution environment 

All these routines run on interrupt levels, so they are not 
pemiitted to page fault They mn within critical sections 
that are serialized with block I/O ofdevei iodone( ) processing. 



include < sys/types.h > 
include < sys/ermo.h > 
include < sys/intr.h > 
include < sys/malloc.h > 
include < sys/sleep.h > 
include <sys/hdj)sn.h> 
include <sys/dasd.h> 
include < sys/vgsah > 
include < sys/hd_config.h > 
include < sys/trchkid.h > 
include < sys/hd.h > 

r 

* NAME: hd_sa_strt 

* FUNCTION: Process a new SA request Put the request on the hold list 

* (sa_hldjst). If the wheel is not rolling start it 

; NOTES: 

* PARAlylETERS: 
*DATASTRUCTS: 

* RETURN VALUE: SUCCESS or FAILURE 
V 
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int 

hd_sa_strt( 

register struct pbuf *pb, 
register struct volgrp *vg, 
register int type). 



r physical device buf struct V 
r volgrp pointer V 
r type of request V 



register struct pbuf *hlst; /* temporary sa_fildjstptr V 
register int rc; T general function return code V 

r 

* if the VG is closing dont start anything 

*/ 

if(vg-> flags &VGJORCEDOFF) 
retum( FAILURE ); 

r 

* If "pb" is NULL then this is a restart from the config routines. 

* The config routines got control of the WHEEL but then found they 

* did not change anything so they just want to restart it 

7 



r 

* Save the type of the request and hang it on the hold list 
V 

pb-> pbjype = type; 
pb-> pb.avjorw = NULL; 
if(vg->sajldjst){ 

r 

* Rnd end of list 
V 

hlst = vg->sa_hldjst; 
while( hlst-> pb.avjoniv ) 

hist = (struct pbuf *)(hlst-> pb.avjorw); 
hlst-> pb.avjorw = (struct buf *)pb; 



if(pb)( 




vg-> sa_hldjst = pb; 
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} 

r 

' start the wheel if not rolling already 

V 

if( l(vg-> flags & (SA JVHL_ACT | SA_WHL.HLD)) ) { 
vg-> flags hSAJVHL.ACT; 

r 

* Generate a aoss memory descriptor - see hd_sa_wrt( ) 

* for reason why it is done here. 
*/ 

vg-> sajbuf.b_xmemd.aspacejd = XMEM JNVAL; 
rc = xmattach( vg-> vgsajjtr, sizeof( struct vgsa_area ), 

&(vg-> sajbuf.b_xmemd), SYS_ADSPACE); 
ASSERT (rc = = XMEM_SUCC); 

hdjsa_cont(vg,0); 

} 

return! SUCCESS ); 

1 

r 

* NAME: hd_sa.wrt 

* FUNCTION: Build a buf structure to do logical 10 to write the next 

* SA on the wheel. 

* NOTES: 

* PARAMETERS: 
*DATASTRUCTS: 

* RETURN VALUE: none 
V 

void 
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hd_sa_wrt( 

register struct volgrp *vg) /* volgrp pointer */ 

' register struct but *ib; Tptr to Ibuf in volgrp struct */ 

register int widx; /* VG wheel index V 

register int rc; T function return code V 

struct xmem xmemd; /* area to save the xmem descriptor*/ 

widx =vg-> wheel idx; 

r 

* Save the cross memory descriptor then zero the but stmcture 

* then stuff it with the necessary fields 

* Saving the cross memory descriptor is faster than attaching/ 

* detatching on each PV write. This way we can attach when 

* the wheel is started and not detach until it stops. 
V 

lb = &(vg-> sajbuf); 
xmemd = lb-> b_xmemd; 
bzero( lb, sizeof(struct buf) ); 

lb->bjags =BJUSY; 

lb-> bjodone = hdjajodone; 

lb-> b_dev = makedev( vg-> major^num, 0); 

lb-> bjikno = GETSA_LSN( vg, widx ); 

lb-> b_baddr = (caddrJ)(vg-> vgsa_ptr); 

lb-> b_bcount = sizeof( stnjct vgsa_area ); 

lb->b_options =REQ_VGSA; 

lb->b_event = EVENT JULL; 

/* restore the cross memoiy descriptor V 
lb-> b_xmemd = xmemd; 

r 

* Save the wheel sequence number that is being written to this 
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*VGSA 

V 

SETSA_SEQ( vg, widx, vg-> whl_seq_num ); 

r 

* Call hd_regular( ) to translate the logical request then hd_start( ) 

* to issue it to the disk drivers. 

* NOTE: hd_regular( ) will use the embedded pbuf in the volgrp . 

* structure, therefore it will never fail due to no 

* pbufs available. This also means that LVO does not 

* have to be open! 

hdjegular( lb, vg ); 
hd_start(); 

return; 

} 

r 

* NAME: hdjajodone 

* FUNCTION: Return point for end of VGSA write operation. 

* NOTES: Prxess any error on the write. This means marking the 

* PV as missing. Then call hd ja^cont( ) to start the next 
SA write if more to do. 

* If a PV is marked as missing there is no pbuf needed to 

* remember when this happened. BECAUSE, there is no 

* specific request waiting on any one particular SA write 

* request. THEREFORE, the only thing that must be done 

* is to ensure the wheel keeps rolling for at least one more 

* revolution from this point. This is done by bumping the 

* whl_seq_num variable. 

* PARAMETERS: 
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*DATASTRUCTS: 

* RETURN VALUE: none 
*/ 

void 

hd_sajodone( 

register stmct buf lb) r ptr to Ibuf in VG just completed 7 
register int sa_updated =0; r nonzero indicates SA updated *l 
struct volgrp *vg; fVGvolgrp ptr from devsw table V 



/* get ttievolgrp ptr from device switch table V 
(void) devswqry( lb-> b__dev, NULi, &vg ); 

lb->bJags& = ''B_BUSY; 

r 

* If en-or on write mark the PV missing 
if(lb->bJags&B_ERROR){ 

r 

* Change pvstate to missing. Set pvmissing flag in VGSA. Check 

* for quomm. Change VGSAtimestamp and sequence number. 

* Log an error message concerning the missing PV. 

V 

register stmct pvol *pvol; /* ptr to pvol of missing pv V 

pvol = vg-> saj)buf.pbj)vol; 

pvol-> pvstate = PV_MISSING; 
SETSA_PVMISS( vg-> vgsaj)tr, pvol-> pvnum ); 
(void) hd_sa_qrmchk(vg ); 
sa_updated = 1; 
hd_sa_update(vg); 
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r 

error message!????); 
V 

) 

r 

* Continue to next VGSA write 
V 

hd_sa_cont( vg, sajjpdaled ); 
return; 

1 

r 

* NAME: hd_sa_cont 

* FUNCTION: Continue writing VGSA areas 

* NOTES: This function is used to start the wheel or keep it 

* rolling. The only thing that stops the wheel once 

* it is rolling is the whl_seq_num variables. When the 

* last write sa__seq_num matches the next one we are 

* complete. 

* If the VG is closing due to a loss of quorum then all 

* active requests are retumed with errors. This will 

* result In an en-or being retumed with the original 

* request. Because of the loss of quorum we can not 

* guarantee the VGSA was updated with the correct infomiation. 
Any user data will be recovered by the MWC cache. 

; PARAMETERS: 
[DATASTRUCTS: 

* RETURN VALUE: none 

7 

void 
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hd_sa_cont( 

register struct volgrp *vg Tvolgrp pointer /* 
register int sa_updated) T ptr to itxjf in VG just completed 7 



register struct pbuf *hkjjeq; T ptr to request being moved to 7 

r active list 

register struct pbuf *alst; rtempsa_actjstptr 7 
register struct but "alstjorw; * address of sa_.actjst avjonw ptr7 
register struct pbuf *newjeq = NULL r ptr to the first new request 7 

r that was put on the active list 7 
register struct buf *alstjb; r ptr to Ibuf for active list pbuf 7 
register struct buf *hld lb; r ptr to Ibuf for hold request pbuf7 
register int n_whl_idx; T new wheel index , 7 
register int 1; /* general counter 7 



r 

* Put the wheel on hold if some config process wants control of it and 

* that process is not waiting for the wheel to stop. Then 

* wake that process up. Said process will restart the wheel when 

* it is finished making if s changes 

* *NOTE* It is assumed the process has everything it needs in memory 

* and it is all pinned. 

7 

if( (vg-> config_wait != EVENT_NU11) && !(vg-> flags & SA_WHL_WAIT) ) ( 

vg-> flags |=SA_WHLJLD; 
vg-> flags &="SA_WHL_ACT; 
xmdetach( &(vg-> sajbuf.b_xmemd) ); 
e_wakeup( &(vg-> configLwait) ); 
, return; 
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* Move any requests currently on the hold list to the active list 

while(vg->sa_hldjst){ 

r Get pbuf at head of list*/ 
hid jeq = vg-> sa_hldjst; 

vg-> sa_hldjst = (struct pbuf *)(hld_req-> pb.avJon«); 

hldjeq-> pb.avjoiw = NULL; 
hldjeq-> pb.av_back = NULL; 

r 

* Scan active list for any request that is doing the same 

* type of request on the same PPs/PVs. If one is found 

* then hang this request on the av^back list Thus, this 

* request will be allowed to continue when the head of the 
*av_backlistis allowed. 

V 

alst=vg-> sa_actjst; 

alstjoniv = (stmct buf **)(&(vg-> sa_actjst)); 

r 

* Scan the active list until the end or we find a match 
V 

while( hidjeq && alst ) { 
if( alst-> pb_type != hld_req-> pbjype ) { 
alstjonv = (stnjct buf **)(&(alst-> pb.avjonw)); 
alst = (struct pbuf *)(alst-> pb.avjoniv); 
continue; 

1 

switch( alst-> pbjype ){ 

case SA_PVMISSING: 
case SA^PVREMOVED: 

r 

* Check the pvol addresses in the pbufs 
V 
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if( alst-> pbj)vol = = hldjeq-> pb_pvol ) ( 

r 

* We have a match. Hang the new request 

'ontheav_backlist. 

V 

hd_sa_hback( alst, hidjeq ); 
hid req^NULL; 

} 

break; 
caseSA_STAl£PP: 

r 

* Check that the device number(b_dev) are the same 

* in the corresponding Ibufs. Then that the IPs 

* are the same. And finally the actual mirrors. 
V 

alstjb=alst->pbjbuf; 

hidjb = hid jeq-> pbjbuf ; 

if( (alstJb-> b_dev = = hIdJb- b_dev) && 
(BLK2PAm"(vg-> parlshift, alst>> b.blkno) = = 
BU<2PAm'(vg-> partshift, hldJb-> b_blkno)) ) { 

r 

* Check mirrors - if a mirror is stale on the 

* active list pbuf but not in the new request 
'pbuf count it as a match. If the bits are 

* reversed the new request must be put on the 

* active list (avjorw) since it must wait 

* for the PP to be marked as stale. 
7 

for(i=0;i<MAXNUMPARTS;i + + )( 
if( (alst-> pb_mirbad & (1 < < i)) 
(hld_req-> pb_mirbad & (1 < < i)) ) { 

if( l(alst-> pb_mirbad & (1 < < i)) ) { 
break; 

1 
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if( i = = MAXNUMPARTS ) { 
/* 

' We have a match. Hang the new request 

^ontheav back list 

7 

hd_sa_hback( alst, hidjeq ); 
hld_req=NULL; 

I ' 
break; 

case SAJRESHPP: 
caseSA_CONFIGOP: 

r 

* Since there can only be one resync operation per 

* LP all fresh PP operations must be unique. 

* Therefore, we can go directly to the end of 

* the acUve list 

* The same thing holds true for config operations. 

* There can only be one active in the VG at a time. 
V 

break; 
default: 

panic("hd_sa_cont: unknown pbuf type"); 
} /* END switch on pb_type V 

r 

* If the new request pointer is NULL then the request was 
put on the av_.back list and we can carry on. Othenivise, 

* we must look further down the av fonv list 
V 
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if( hidjeq ) { 
alstjorw = (struct but * *)(&(alst-> pb.avjorw)); 
aist s (struct pt)uf *)(alst-> pb.avJorw); 

}rENDwhile(hldLreq&&alst)V 

r 

* If alst is NULL we are at the end of the active list. 

* Put the new request on the Rst and modify the VGSA as per 

* the type of request 
V 

if( laist ) { 
*alstJonw = (struct buf *)hldjeq; 

r 

* If the timestamp on the memory version of the VGSA has 

* not been bumped do it now. Then remember the address of 

* this first pbuf to be added to active list this pass. 
7 

if( Isa^updated ) { 
sa_updated = 1; 
hd_sa_update(vg); 

if( Inewjeq) 
newjeq = hldjeq; 

switch( hldjeq-> pbjype ) ( 

register struct Ivol *lv;/*ptr to Ivol structure */ 
register struct part *part; T ptr to PP part structure*/ 
register ulong Ip; /* request LP number */ 
register ulong pp; /* min-or PP number V 
register int mirrors/* mirror mask for action V 
register int i; /* general V 

case SA_PVMISSING: 
case SA_PVREMOVED: 



FIG. 21-104 

139 



EP0482 853 A2 



r 

* Change pvstate to missing. Set pvmissing flag in 

* VGSA. (If removed PV, update the VG's quorum count. 

* beiore it is rechecked.) Check the quorum. 

*^Log an error message concerning the missing/removed PV. 

hld_req-> pb j)vol-> pvstate = PV_MISSING; 
SETSA_PVMISS( vg-> vgsaj)tr, hld_req-> pb_pvol-> pvnum ); 
if ( hid jeq-> pbjype = = SA_PVREMOVED ) 

vg-> quonjm__cnt = hldjeq-> pb.b_work; 
(void) hd__sa_qnnchk( vg ); 
/* 

error message(????) 



break; 

caseSA_STALEPP: 
caseSAJRESHPP: 

r 

* For SA_STALEPP the pb_mirbad field in the pbuf 

* indicates which mirrors should be marked as 

* stale. For SA_FRESHPP the pb_mirdone field in 

* the pbuf indicates which min'ors should be made 

* fresh(active). 

* Find the LV Ivol structure and LP number of the 

* logical request. 
V 

if( hldjeq-> pb_type = = SA_STALEPP ) 

mirrors = hldjeq-> pb_mirbad; 
else 

mirrors = hldjeq-> pb_mirdone; 
hldJb = hld_req-> pbjbuf; 
Iv = VG_DEV2LV( vg, hldJb-> b_dev ); 
Ip = BLK2PART( vg-> partshift, hldJb-> b biknd ); 
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* Now scan the mirrors bits and for each one that 

* is set log an en'or message concerning the 

* operation then set/reset corresponding 

* bit in the in memory version of the V6SA. 
V 

while! mirrors ) { 
1 ^ FIRST MASK( mirrors); 
mirrors & = ('(MIRROR.MASK(i))); 
part=PARTmON(lv,lp,i); 
pp = BLK2PART( vg-> (wrtshift, 

part-> start - part-> pvol-> fst_usr_blk ); 
if( hid jeq-> pbjype = = SA_STALEPP ) { 

r 

enoT message!????) STALE pp + 1 

V 

SETSA_STLPP(vg-> vgsaj)lr,part-> pvoh> pvnum.pp); 

1 

eise{ 

r 

error message(????) FRESH pp + 1 
*/ 

CLRSA_STl_PP(\/g-> vgsaj}tr,parl-> pvol-> pvnum.pp); 



break; 

case SA_CONFIGOP: 

r 

* No action needed on a hd_config routine request 

* the in memory version was modified when the wheel 

* was put on hold and control passed to the config 

* routines. 
V 

break: 
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default: 

panic("hdjsa_cont unknown pbuf type"); 

} /* END of switch on pbjype */ 
)rENDofif(!alst)V 
}/* END while! sajdjst)*/ 

r 

* At this point everything is on the active list and the appropriate 

* action taken. If we have lost a quomm due to said action then 

* return all requests on the active lists with en-orslENXIO) if 

* they do not currently have an error indicated. Before getting out 

* dear the active and hold flags and detach the VGSA memory area 
V 

if( vg-> flags & VGJORCEDOF ) { 
while(vg-> sa_actjst){ 
r Get pbuf at head of list*/ 
alst=vg->sa_actjst; 

vg-> sa_actjst = (struct pbuf *){alst-> pv.av_fonw); 
hd_sa_rtn(alst,RTN_ERR); 

vg-> flags & = ('(SA_WHLACT | SA_WHL_HLD)): 
xmdetach( &(vg-> sajbuf.b_xmemd) ); 

r 

* If the wait flag is on then a config function is waiting for 

* the wheel to stop. So, infonn that function that is has. This 

* is used so the varyofh/g function will wait, if the wheel is 

* rolling, before removing the data structures. 
V 

iff vg-> flags &SA_WHL_WAIT){ 
vg-> flags &='SA_WHL_WAIT; 
e_wakeup( &(vg-> configjvait) ); 

} 

return; 
} r END if( VG closing ) V 
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* Now see if any request should get off of tiie wheel. This algorithm 

* assumes that a VGSA can not be removed from the wheel if anyone is 

* using it as a stopping point 

V 

n_whl_idx = hdjajivhladvj vg, vg-> wheeljdx ); 
while( (vg-> sa_.actjst) && (new jeq 1= vg-> sa_act_lst) ) { 
if (vg->sa act-lst-> pbjwhljtop = = n_whLidx ) { 

r 

* Time for this request to get off the wheel and continue 
alst=vg-> sa_actjst; 

vg-> sa_actjst = (struct pbuf *)(aIst-> pb.avJonw); 
hd_sa.rtn(alst.RTN_NORM); 

else if( (vg-> pvolsff[n_whlJdx > > 1l-> pvstate = = PVJISSING) || 
(NUKESA( vg, njvhl idx) = =TRUE) ) { 

r 

* if the next wheel index is on a missing PV, i. e. an 

* inactive VGSA, advance to the next wheel index and see 

* if any request should get off at it. Also, if the SA 

* is to be removed(nuked) then do it now. 

" *HOJ£* We should never get here is we lose a quorum. As 

* a safety measure the assert is in place to prevent 
an infinite loop. If we go completely around the 

* wheel without finding an active VGSA we have a 

* problem somewhere. 
*/ 

if( NUKESA( vg, n_whljdx) = = TRUE ) { 
SETSA_LSN(vg,n_whlJdx, 0); 
SET JUKESA( vg, n_whljdx. FALSE); 

n_whljdx = hdja_whladv( vg, n_whljdx ); 
assert( n_whljdx != vg-> wheeljdx ); 

1 

else 
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break; 



We got out of the last loop under 1 of 3 conditions 

1. The active list was empty. 

2. The head of active list points to a request that was just added. 

3. The head of active list has a stopping point further around 
the wheel and we are at the next active VGSA to write. 

At this point we must make sure the next wheel index(n_whljdx) is 
pointing at an active VGSA. i.e. we came out because of condition 
1 or 2. if the VGSA is inactive advance to the next active one 
and then set the pb_whljtop fields of any new requests. Thus, no 
request gets put on the wheel at an inactive VGSA. 



while( ( vg-> pvolsff[n_whlJdx > > 1]-> pvstate = = PV MISSING) 11 
(NUKESA(vg,njwhlJdx) ==TRUE)){ " 

/* 

*NOTE* We should never get here if we lose a quorum. As 
a safety measure the assert is in place to prevent 
an infinite loop. If we go completely around the 
wheel without finding an active VGSA we have a 
problem somewhere. 

7 

if( NUKESA( vg, n_whljdx) = = TRUE ) ( 

SETSA LSN( vg, n^whljdx, 0); 
^ SET_NUKESA( vg, n_whLidx, FALSE); 

n__whljdx = hd_sa_whladv{ vg, n_whljdx ); 
assert( n_whljdx 1= vg-> wheeljdx ); 

while( new_req){ 

newjeq-> pb_whl_stop = n_whljdx; 
^ newjeq = (struct pbuf *)(newjeq-> pv.avjonv); 

r Save the next wheel index V 
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vg-> wheeljdx = n_whljdx; 



* Check to see if the current VGSA sequence number has been written 

* to the next VGSA. If it has not then write it If it matches 

* then we have written the latest SA to all available VGSAs so 

* stop the wheel. 
V 

if( vg-> whl_seq_num != GETSA_SEQ( vg, njwhljdx ) ) 

hd_sa_wrt{ vg ); 
else I 

vg-> flags &="SAJVHLACT; 
xmdetach( &(vg-> sajbuf.b_xmemd) ): 

r 

* If the wait flag is on then a config function is waiting for 

* the wheel to stop. So, inform that function that it has. This 

* is used so the varyoffvg function will wait, if the wheel is 

* rolling, before removing the data structures. 

V 

if( vg-> flags & SA_WHL_WArr ) { 
vg-> flags &="SA_WHL_WAIT; 
e_wakeup( &(vg-> configLwait) ); 

1 

) 

r 

* Just in case anything was unblocked or the cache hold queue was 

* moved to the pendingjQ 
*/ 

hd_schedule ( ); 
return; 



NAME: hd_sa_hback 

FUNCTION: Hang a pbuf on the end of the given av_back list 
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* NOTES: This function is used to find the end of the given pbuf 

* list via the av_back pointer. Then, link the new pbuf 

* on to the list there. Assumes the av__back pointer in the 

* new pbuf is NULL 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: none 
V 

void 

hd_sa_hback( 

register stmct pbuf *headj)tr, /* head of pbuf list ^ V 
register struct pbuf *newj)buf) /"ptr to pbuf to append to list V 

while( head j)tr-> pb.av_back ) 

headj)tr = (struct pbuf *)(head_ptr-> pb.ay_back); 

head_ptr-> pb.av_back = (struct buf *)newj)buf; 

return; 

} 

r 

* NAME: hd_sa_rtn 

* FUNCTION: Return the given av_back list of request to their 

respective caller. 

; NOTES: 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: none 
V 
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void 

hd_sa_rtn( 

register struct pbuf *head_ptr, T head of pbuf list V 
register int errjg) /* if true return requests with */ 

rENXIO error */ 

register struct pbuf *lstj)tr; /* anchor for av_back list V 

while(headj)tr){ 

r 

* piggybacked requests are on the av_back chain 
*/ 

lstj)tr = (stmct pbuf *)(headj)tr-> pb.av_back); 

r 

* if the request should be returned with an en^or but the 

* B JRROR flag is off TURN IT ON. Dummy up address so it 

* looks like none of the request worthed 

7 

if( (en-Jg = = RTN JRR) && {l(headj)tr-> pb.bjags & B JRROR)) ) { 
head j)tr-> pb.b_flags | = B JRROR; 
head_ptr-> pb.b_error = EIO; 
head_ptr-> pb_addr = headj)tr-> pbJbuf-> bjaddr; 



r Set the B JONE flag to indicate the request is done V 
head j)tr-> pb.bjags |= B_DONE; 

r 

^ return the request via wakeup or function call 

* it is possible for b_event to still be EVENT_NULL because of 

* some error and pb^sched to be NULL If this condition exists 

* just drop the request and the caller will see it is complete 

* by checking the B_DONE 

if( head_ptr-> pb.b.event != EVENT_NULL) 
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ejvakeupl &(head_ptr-> pb.b_event) ); 
else if( headj)tr-> pb_sched ) 
HDJCHED(headj)tr); 

r 

* get the next one off of the list 
V 

headj)tr=lst _ptr; 
}rENDwhile(headj)tr)V 
return; 

} 

r 

' NAME: tidjsa_whladv 

I FUNCTION: Advance wheel index to next VGSA 

* NOTES: The wheel index has 2 components. A primary/secondary 

bit, the low order bit of the index. This controls which 

* VGSA is being indexed on any particular PV. The second 
component is the PV index. It is the remaining bits of 

* index, it is used as the index into the pvols array in 

* the volgrp stmcture. This mechanism assumes that the 

* maximum number of PVs in a VG is a power of 2. 

If MAXPVS is a power of 2 this function will be much 

* more efficient. 

] PARAMETERS: 
I DATASTRUCTS: 

* RETURN VALUE: next VGSA on the wheel 

V 
int 

hdjajvhladv( 
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register Struct volgrp *vg, Tvolgrp pointer */ 

register int c_whl Jdx) I* current wheel index */ 

cjvhijdx + + ; 
while( 1 ) { 

c_whlJdx% = (MAXPVS*2); 

r 

* If no pvol pointer then advance index to next PV. 

* If pvol pointer then look to see if there is a logical sector 

* number associated with the index. If so we have found the 

* next VGSA index. If not bump the index and look again. 
V 

if( !(vg-> pvolsff[ cjvhIJdx > > 1 ]) ) ( 

r 

* If the index is odd just bump it by 1 to get to next PV. 

* If it is even bump the index by 2 to get to the next PV. 

V 

if(c_whljdx&1) 

cjivhljdx+ = 1; 
else 

c_whljdx + = 2; 

} 

else if( GETSAJSN( vg, c_whljdx ) ) 

break; 
else 

c_whljdx + = 1; 

1 

retum( c_whljdx ); 

1 

r 

* NAME: hd_sa_update 

* FUNCTION: Update the in memory version the VGSA timestamps 

* and sequence number. 
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* NOTES: 
PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: none 
V 

void 

hd_sa_update( 

register struct volgrp *vg) /* volgrp pointer V 
{ 

hd jgettime( &(vg-> vgsa_ptr-> b_tmstamp) ): 

vg-> vgsa_ptr-> ejmstamp = vg-> vgsaj3tr-> bjmstamp; 

r bunp sequence number V 

vg->whlj5eq_num++; 

return; 

} 

r 

* NAME: hd_sa_qrmchk 

* FUNCTION: Check the VG for a quorum of SAs 

* NOTES: Count the number of active VGSAs. If the count 

* is less than the threshold(quorum_cnt) set the 
VG_FORCEDOFF flag so the VG will unwind and shutdown. 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: count of active VGSAs 

V 
Int 

hd_saj|rmchk( 
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register Struct volgrp *vg) /* volgrp pointer V 

register int adjcnt; /* count of active VGSAs */ 
register int idx; yPV index */ 

r 

* loop thm the pvols array in the volgrp structure 
V 

for( act_cnt=0. idx=0; idx < MAXPVS; idx + + ) { 

if( (vg-> pvolsffpdx]) & & (vg-> pvolsffpdx]-> pvstate 1= PV_MISSING) ) { 
lf( vg-> pvolsff[idx]-> sa_arealf[01 .Isn ) 

act_cnt + + ; 
if{ vg-> pvolsff[idx]-> sa_areaff[1 J-lsn ) 
act_cnt+ + ; 

1 

} 

r 

* If the VG is already closing there is not need to do this ail again 
V 

if( !(vg-> flags & VGJORCEDOFF) && {act_.cnt < vg-> quorum_cnt) ) { 
vg->flags| = VG_FORCEDOFF; 

r 

en'or message(????) Loss of quorum VG is closing 
V 

) 



1 

r 



return(act_cnt); 



NAME: hd_sa_config 

FUNCTION: Interface for hd_config routines to access the 
VGSA wheel. 

NOTES: Assumes the hd_config routine has the VG lock. 
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* Thus preventing more than one operation at a time. 

AND 

* The arg variabie(array) is in memory and PINNED. Since 
this routine may t)e executed during oiflevel intenupt 
processing it can not page fault or rely on any disk 10. 

* There are 3 phases to the hdjconfig routines modifying 
theVGSAs 

1 . Getting control of the wheel if it is rolling. 

2. Modi^ing the in memory VGSA. 

* 3. Restarting the wheel and waiting for one 

* revolution. 

This function takes care of all of these for the dailer. 

* PARAMETERS: 

* DATASTRUCTS: 

* RETURN VALUE: SUCCESS or FAILURE 
V 

int 

hd_sa_config( 

register struct volgrp *vg, /* volgrp pointer V 
registering type, T type of hd^config request */ 
register caddr J arg) Tptr to arguments for the request*/ 

' register struct pbuf *pb; Tptrtoapbufstructtouse */ 
register struct pvol *pv; /*ptr to target pvol struct */ 
register stnjct cnfg_pp_state *ppi; 
register struct cnfgLpvJns *pvJnfo; 
register struct cnfgj)v_del *pvdelJnfo; 
register stnjct cnfg_pv_vgsa *vgsaJnfo; 
register stnjct pvol *pvol; 
register int o_prty = -1; T saved interrupt priority 7 
register int rc; /* function return code V 
register int i; T general counter V 
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register Struct sajxt *saext; /*argforHD_KEXTENDV 

register struct part *oldpp; Told part structsV 
register struct part *newpp; /* new part structsV 

register struct sajed *sared; /* aig for HD_KREDUCE V 

register int clear_pv; /* PV missing flags have changed */ 

register int rollwheel; /* indicates we should start wheel V 

register int re^enable; /* shows a need to re-enabie V 
register stmct extredj)art 'pplisty* ptr to pps to reduce *l 
struct part *oldpartsff[MAXNUMPAFn'S]; r ptrs to old part structs V 
register int ppant,cpcnt,ix; /* for loop indexes */ 
register struct part "pp, 'ppl, *ppnew/ ptrs to part structs in reduce V 
register short ppnum; 1* pp numt)er used in reduce V 

register int copy,cpymsk,ipmsk,redpps,stlpps,statechg; 

r mask variables V 

r 

* if the VG is closing return error 
*/ 

if (vg-> flags &VG_FORCEDOFF) 
return! FAILURE ); 

pb = (struct pbuf *)xmalloc(sizeof(struct pbuf),HD_ALIGN,pinned_heap); 
if(pb = = NUll) 
return! FAILURE ); 

rc = SUCCESS; 

oj)rty = i_disable(INTIODONE); /* start critical section V 

r 

* Do what the caller wants 

switch! type ) { 

case HD_KMISSPV: 
case HD KREMPV: 
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* Assumes that only one PV at a time can be marked as 

* missing/removed. 

V 

pvdeljnfo = (struct cnfg_pv_del *) arg; 

r 

* zero cut the DALVs LP on this PV 
V 

bzero( pvdelJnfo-> Ipj3tr, pvdeljnfo-> Ipsize); 

r 

* Go build a pbuf to give to the SA write routines. This 

* way they do all quomm checking and clean up. 

* ( If removing a PV, save the new quomm count in pbuf so 

* hd jajcont can update the VG's quonjm count right before 

* the quomm is rechecked.) 
*/ 

hd_blclpbuf{ pb, (stmct pvol *) pvdelJnfo-> pvj)tr, type, 

NULL, 0, NULL, NULL); 
if(type = = HD_KREMPV){ 

pb-> pb.b_work _ pvdelJnfo-> qrmcnt; 

rc = hdja.strt( pb, vg, SA^PVREMOVED ); 

1 " 

6iS6 

rc = hd_sa_strt( pb, vg, SA_PVMISSING ); 
if(rc = = FAILURE) 
break; 

r 

* If the done flag is on at this point the pbuf has been 

* completed and if we sleep the calling process will hang. 
*/ 

if( !(pb-> pb.bjags & B_DONE) ) 
e_sleep{&(pb-> pb.b_event), EVENT_SHORT); 

r 

* if the en'or flag is set return FAILURE to the caller 
7 
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if( pb-> pb.bjags & B.ERROR ) 
rc = FAILURE: 

break; 
caseHD KADDPV: 

r " 

* perform miscellaneous tasks that must be done disabled 
V 

pvjnfo = (struct cnfgLpvJns *) arg; 
pv = (stnjct pvol *)(pbJnfo-> pvol); 

if (vg-> pvolsff[pvJnfo-> pvjdx] = = NULL) 
r set pvol structure pointer for add of a new PV - / 
vg->pvolsff[pvJnfo-> pvjdxj = pv; 

else 

r copy new pvol data for add of a previously missing PV 7 
bcopy ((caddrjpv, (caddrJ)vg-> pvolsff[pvJnfo-> pvJdx], 

sizeof(stmctpvol)); 

if(vg-> open_count 1=0) 
hdj)vs__opn+ +; r bump number of open PVs 7 

r 

* If we're varying on the VG then return, 

* otherwise initialize the VGSA 

* on this new PV via the WHEEL 
7 

if (vg-> flags & VG_OPENING) 
break; 

r 

* Get control of the wheel if it is rolling. 
7 

if(vg-> flags &SA_WHL_ACT) 
e_sleep(&(vg-> config_wait), EVENT_SHORT); 

if(vg-> flags &VG_FORCEDOFF){ 
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rc= FAILURE; 
break; 

} 

r update VG's quorum count to include this new PV V 
vg-> quommjcnt = pvJnfo-> qrmcnt; 

r 

* initialize the SA_SEQ_NUM to a value that will 

* make sure the VGSA on this new PV will be written, 

* and then reset the PV missing flag in the memory 

* copy of the VGSA. 
V 

if (pv-> sa_areaff[0].lsn) 

pv-> sa_areaff[0 .sa_seq_num = vg-> whl_seq_num - 1 
if {pv->sa_areaff[1 .Isn) 

pv-> sa_areafl[1] .sa_seq_num = Yg-> whljseq_num - 1 

CLRSA_PVMISS( vg-> vgsa_ptr, pv-> pvnum ); 

r 

* Now force the wheel one revolution. Build a pbuf 

* to give the the wheel, reset the SA holding flag, 

* (rejstait the wheel, wait for the wake up to signal 

* tiie wheel has completed the operation, check status. 
V 

rc = hd__sa_onerev(vg, pb, type); 
break; 

case HD_KEXTEND: 
case HD_KREDUCE: 

r 

* Get control of the wheel if it is rolling. 
V 

if(vg-> flags &SA_WHL_ACT) 
e_sleep(&(vg-> configLwait). EVENT_SHORT); 
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if ( vg.> flags & VG.FORCEDOFF ){ 
rc= FAILURE: 
break; 

) 

r 

* Now that tlie wheel is ours we can do what needs to be 
*done. 

V 

switch(type){ 
case HD.KEXTEND: 

r 

* set up a pointer to the arguments passed in and loop 

* through the cnfgj)p_state structures to process the pps 

* until we come to a ppstate that is CNFG STOP 
V 

saext = (stmct sa_ext *) arg; 

for(ppi = saext-> vgsa;(ppi-> ppstate != CNFG_STOP);ppi + + ) { 
if(n"STSA_SUPP(vg-> vgsaj)tr,ppl-> pvnum,ppi-> pp) ? STALEPP 
FRESHPP) != ppi-> ppstate) { 
XORSA_STLPP(vg-> vgsaj)tr, ppi-> pvnum,ppi-> pp); 
roilwheeUTRUE; 



if(rollwheel = = TRUE) ( T we changed the VGSA V 

r 

* force the wheel one revolution. Build a pbuf to give 

* to the wheel, reset the SA holding flag, (re)start 

* the wheel, wait for the wake up to signal that the 

* wheel has completed the operation, ched status. 
*/ 

hd_bldpbuf(pb, NULL, type, NULL, 0, NULL, NULL); 
vg-> flags &='SA^WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA CONFIGOP ); 
if(rc = = FAILURE) 
break; 
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r 

* If the done flag is on at this point the pbuf has 

* t)een completed and if we sleep, the calling process 

* will hang. 
*/ 

if( !(pb->pb.bJags&B_DONE)) 
ejleep(&(pb-> pb.b_event), EVENT.SHORT); 

r 

* If the error flag is set return FAILURE to the 

* caller. 
*/ 

if( pb-> pb.b Jags & B_ERROR ) { 
rc= FAILURE; 
break; 

} rend if roll wheel ==TRUEV 

r 

* call hd_extend( ) to check for resync in progress and to 

* transfeTthe new iv infonnation to the old Iv infomiation 
V 

rc = hd_extend(saext); 
br63k* 
case HD^KREDUCE: 

r set up the needed pointers and variables V 
sared = (stmct sajed *) arg; 
rollwheel = FALSE; 
pplist=sared-> list; 
for(i = 0;i<MAXNUMPARTS;i + + ) 
oldpartsffp] = sared-> lv-> partsffp]; 

r 

* for the number of physical partitions being reduced, go through 

* the logical partitions and build masks for the pps being 

* reduced, pps that are stale, and the pps that exist; and, 

* check that there are no resyncs in progress. Once the masks 

* are built go through and check that we arent reducing the last 

* good copy of the Ip. After this, we have finished the validation 
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* phase and can then begin the process phase in which we 

* go through and turn on the PP_REDUCING bits and the 

* PP_STALE and PP_CHGING bits in the active pps that are being 

* reduced. 

V 

for(ppcnt = 1 ; ppcnt < = sared-> numred; ppiist + +, ppcnt + + ) { 
if(pplist-> masic != 0) { 
cpymsk= MIRROR_EXIST(sared-> lv-> nparts); 
lpmsl(=stlpps=0; 
redpps = pplist-> masl^; 
while(cpymsk !=ALL_MIRRORS) { 
copy = FIRST_MIRROR(cpymsk); 
cpymsk 1= MIRROR_MASK(copy); 
pp = PARTITION(sared-> lv,(pplist-> lp_num - 1),copy); 
if(pp-> pvol) ( 
if(copy = = 0)( 
if(pp-> sync_trk != NO.SYNCTRK) { 

rc = FAILURE; 

sared-> error = CFG_SYNCER; 
break; 



Ipmsk 1= MIRROR_MASK(copy); 
if((pp-> ppstate & (PP_STALE | PP_CHGING)) = = PP.STALE) 
stipps 1= MIRROR_MASK(copy); 

) r end if there is a pvol in this part stmct V 
) r end while V 
if(rc = = FAILURE) 

break; 

r 

* if we're not reducing all of the copies of this Ip, check 
Mo be sure we're not reducing the last good copy 

if(redpps Ipmsk) { 
/* if there are no good copies left V 
if(l((stlpps I redpps)-. Ipmsk)) { 
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rc = FAILURE; 

sared-> error = CFGJNLPRD; 
break; 

jTendif redpps-ilpmskV 

} rend if V 
irendforV 
if(rc = = FAILURE) 

break; 

r now that we've validated the data, we can proceed V 
pplist = sared->list; 

for(cpymsk = pplist-> mask; cpymsk; cpymsk & = 'MlRROR_MASK(copy)) { 
copy = FIRST_MASK{cpymsk); 
pp = PARTITION(sared-> lv,(pplist-> lp_num - 1),copy);> 
if((pp-> ppstate & (PP.STALE | PP_CHGING)) = = PP_STALE) 

pp-> ppstate 1= PP_REDUCING; 
bIso I 

pp-> ppstate 1= (PP.STALE | PP.CHGING | PP.REDUCING); 
ppnum = BLK2PARr(vg-> partshift, 

(pp-> start - pp-> pvob fst_usr_blk)); 
SETSA_STLPP{vg-> vgsaj)tr,pp-> pvol-> pvnum.ppnum); 
roiiwheeUTRUE; 

ir end for*/ 

nf we changed the VGSAV 
if(rollwheel==TRUE){ 

r 

* force the wheel one revolution. Build 

* a pbuf to give the wheel, reset the SA 

* holding flag, (rejstart the wheel, wait for 

* the wake up to signal the wheel has completed 

* the operation, check status. 
V 

hd_bldpbuf(pb, NULL, type, NULL, 0, NULL, NULL); 
vg-> flags &="SA_WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA_CONFIGOP ); 
if(rc = = FAILURE) 
break; 
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r 

* If the done flag is on at ttiis point the 

* pbuf has been completed and if we sleep the 

* calling process will hang. 
V 

if( !(pb-> pb.bjags & B^DONE) ) 
e_sleep(&(pb-> pb.b_event), EVENT_SHORT); 

r 

* If the error flag is set return FAILURE to 

* the caller. 
V 

if(pb->pb.bJags&BJRROR){ 
rc= FAILURE; 
break; 

} 

r 

* if the logical volume is open, then them 

* drain the logical volume : wait for all requests currently 

* in the Iv work queue to complete 

if(sared-> lv-> Ivjtatus = = LV_OPEN) 
hd_quiet(makedev(vg-> major_num,sared-> min_num),vg); 
} r end if rollwheel V 
else 

r 

* if we didnt change the VGSA, then release the inhibit 

* on the wheel and restart it if it was rolling when we 

* started 

V 

if{vg-> flags &SA_WHL_HLD){ 
vg->flags& = "SA_WHL_HLD; 
rc = hd__sa_strt(NULL, vg.SA_CONFIGOP); 
if{rc = = FAILURE) 
break; 

} 
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r reset the pplist pointer to the t)eginning of the list V 
pplistssared->list; 

r 

* call hd_reduce( ) to handle promotions and to transfer the 
' new Iv information to the old Iv infomiation 

7 

hdjeduce(sared,vg); 
break* 

}rENDofswitch(type)7 
break; 

case HD_KDELPV: 

pvdeljnfo = (struct cnfgLpv_del *) arg; 
pv ~ (struct pvol *)(pYdelJnfo-> pvjjtr); 

I* update the VGquonjm count— 

* For a PV to be deleted, NO partitions may be allocated, 

* therefore, we don't have to tie as careful here as we 

* are with REMOVEPV when we update the quomm count. 

vg-> quorumjcnt = pvdelJnfo-> qrmcnt; 

r 

If the wheel is not rolling just remove the pvol pointer 

* from the volgrpstnicture. If any PV missing flags should 

* be reset then reset them and roll the wheel when ready. 

If the wheel is rolling things are not so simple. The 

* pvol pointer cannot be jerked out from under the wheel if 
a request is using it as a stopping point. Therefore, 
mark the PV missing in the pvol stmcture, then issue a 

* config request to the wheel forcing the wheel to go 

one revolution. Since the PV was marked as missing before 
the config request, it is guaranteed that no request will 
be using it as a stopping point. It is also guaranteed 

* that the wheel index will not be setting on any missing PV 
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' Therefore, at this point the pvol pointer can be removed 

* safely. 
V 

if(vg-> flags &SA_WHL_ACT){ 
pv->pvstate = PV_MISSING: 

r 

* Now force the wheel one revolution. Build a ptHjf 

* to give the wheel, reset the SA holding flag, 

* (rejfstartthe wheel, waitforthe wake up to signal 

* the wheel has completed the operation, check status. 
V 

if (rc = hd_sa_onerev(vg, pb, type) 1= LVDD_SUCCESS) 
break; 

} TEND of if the wheel active V 

/* zero out the VG's pvol ptr V 
vg-> pvolsff[ pv-> pvnum ] = NULL; 

r 

* Miscellaneous updates that must be made disabled: 

* delete the DALVs LP on this PV, decrement the global 

* PV open count, and update the VG's quorum count 
V 

bzero ( pvdelJnfo-> lp_ptr, pvdelJnfo-> Ipsize); 
if ( vg-> open_count b 0 ) 
hdj)vs_opn-; 

break; 

case HD_KADDVGSA: 
case HD_KDELVGSA: 

vgsajnfo = (struct cnfgj)v_vgsa *) arg; 
pv = vgsaJnfo-> pv _ptr; 

vg-> quorum_cnt = vgsaJnfo-> qrmcnt; 
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if(type= = HD KADDVGSA){ 

r 

* ADDING VGSA(s) to this PV - fill in the VGSA LSNs 

* and change the PVs VGSA sequence numt)er so this 

* PVs vgsas will be written. 
V 

if (vgsajnfo -> sajsnsff[0]) { 
pv->sa_areaff[0].lsn = vgsajnfo -> sajsnsff[0]; 
pv->sa_areaff[0].sa_seq_num = vg -> whLse(Lnum - 1 ; 

if (vgsajnfo -> sajsnsff[1]) { 
pv->sa_areaff[1].lsn = vgsajnfo -> sajsnsff[1]; 
pv->sa_areaff[1].sa_seq_num = vg -> whl__se(i_num - 1 ; 

} " " 

r 

* get control of the wheel and wait for it to oin 

* one full revolution 

V 

if(vg-> flags &SA_WHL_ACT) 

e_sleep(&(vg-> config_wait). EVENT_SHORT); 
if { vg-> flags & VG JORCEDOFF ) { 

re = FAILURE; 

break; 

} 

rc = hd_sa_onerev(vg, pb, type); 
else{ 

r 

* DELETING VGSA(s) from this PV - if the wheel is active, 

* get control of it, set the flag for the VGSA(s) being 

* deleted, and then wait for the wheel to run one 

* revolution (the LVDD code that runs the wheel will zero 
" out the VGSA LSN when the nukesa flag is set). 

* If the wheel is NOT active, then just zero out the VGSA 

* LSN's now. 

if(vg-> flags &SA_WHL_ACT){ 
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e_sleep(&(vg-> config_wait), EVENT^SHORT); 

if ( vg-> flags & VGJORCEDOFF ){ 
rc= FAILURE; 
break; 

} 

if (vgsajnfo -> sajsnsff[0]) 
pv-> sa_ar8alf[0].nukesa = TRUE; 

if (vgsajnfo -> sajsnsff[1]) 
pv-> sa_areaff[1].nukesa = TRUE; 

rc = hd_sa_onerev(vg, pb, type); 

else { /* the wheel is NOT roiling V 
if (vgsajnfo -> sajsnsff[0]) 

pv-> sa_areaff[0].lsn = 0; 
if (vgsajnfo -> sajsnsff[1]) 

pv-> sa_areaff[11.lsn = 0; 

} 

1 

break; 
case HD_MWC_REC: 

r 

* Just update the VGSA: 

* get control of the wheel and wait for it to run 

* one full revolution. 
V 

if (vg-> flags &SA_WHLACT) 

e_sleep(&(vg-> config_wait), EVENT_SHORT); 
if ( vg-> flags & VG JORCEDOFF ) { 

rc = FAILURE; 

break; 

} 
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rc = hd_sa_onerev(vg, pb, type); 
break; 

default: 

panic("hd_sa_config: unknown request type"); 
} r END of switch! type ) V 

Lenable(o_prty); /* return to caller priority V 

r Give back the memory we borrowed for the pbuf struct V 
assert(xmfree(pb,pinned_heap) = = LVDD_SUCCESS); 

return( rc ); 
' NAME: hd_sa_onerev 

' FUNCTION: Force the WHEEL one revolution to update the VGSA 
' on all active PVs 

I NOTES: 

PARAMETERS: vg -pointer to volume group 
pb - pbuf pointer 
I type - type of VGSA config operation 

[ DATASTRUCTS: 

I RETURN VALUE: none 

7 
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int 

hdja_onerev( 
register struct volgrp *vg, 
register struct pbuf *pb, 
register int type) 



fptr to volgrp struct 
/*ptr to pbuf struct 
r type of pbuf to build 



V 
V 
V 



register int rc; 



* Now force the wheel one revolution. Build a pbuf 

* to give the the wheel, reset the SA holding flag, 

* (re)start the wheel, wait for the wake up to signal 

* he wheel has completed the operation, check status. 



hd_bldpbuf( pb, NULL, type, NULL, 0, NULL. NULL); 
vg-> flags & = "SA_WHL_HLD; 
rc = hd_sa_strt( pb, vg, SA CONFIGOP ); 
if(rc = = FAILURE) 
return(rc); 



* If the done flag is on at this point the pbuf has been 

* completed and if we sleep the calEng process will hang. 



if(!(pb->pb.b_flags&B_DONE)) 
e_sleep(&(pb-> pb.b_event), EVENT^SHORT); 

r 

* If the error flag is set retum FAILURE to the caller 

V 

if(pb->pb.b_flags&BJRROR) 
rc = FAILURE; 

retum(rc); 
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* NAME: hdjidpbuf 

* FUNCTION: Initialize a pbuf structure for LVDD disk io. 

* NOTES: 

* PARAMETERS: none 

* DATASTRUCTS: 

* RETURN VALUE: none 
V 

void 

hd_bldpbuf( 

register struct pbuf *pb, Tptr to pbuf struct 7 

register struct pvol *pvol, /* target pvolptr 7 
register int type, r type of pbuf to build */ 
register caddrj bufaddr, /* data buffer address -system */ 
register unsigned cnt /* length of buffer V 
register stmct xmem *xmem, 1* ptr to cross memoiy descriptor*/ 
register void (*sched)()) /* point to function returning void V 

' register stnjctbuf *lb; /* ptr to buf stmct part of pbuf V 

r 

* Zero the pbuf then stuff it with the necessary fields 
V 

bzero( pb, sizeof(struct pbuf) ); 

lb=(stajctbuf*)pb; 
if(pvol) 
lb->b_dev = pbol-> dev; 

lb->b_baddr= bufaddr; 
lb-> b_bcount = cnt; 
lb->b_event=EVENT_NUll; 
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if(xmem) 
lb-> b_xmemd = *xmem; 

pb-> pbjched = sched; 
pb->pbj)vol = pvol; 

switch(type){ 

/* mirror write consistency cache write type */ 
caseCATYPE^WRT: 

lb-> bjodone = hd_ca_end; 
lb-> bjags = B_BUSY | B^NOHIDE; 
lb-> bjikno = PSN_MWC_RECO; 
breaic; 

caseHD_MWC_REC: 
case HD_KMISSPV: 
caseHD_KREMPV: 
case HD__KREDUCE: 
case HD_KEXTEND: 
caseHD.KADDPV: 
caseHD_KDELPV: 
caseHD.KADDVGSA: 
caseHD.KDELVGSA: 

lb-> bjodone = NULL; 
lb->bJags=B_BUSY; 
breai(; 

default 

panic("hd_vgsa: unl<nown pbirf type"); 
breal(; 

} 

return; 
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* NAME: hd.extend 

* FUNCTION: Transfers old part struct information to new part struct 

* information. 

^ NOTES: 

* PAfRAMETERS: saext pointer to information structure for the extend 

* DATASTRUCTS: 

* RETURN VALUE: SUCCESS or FAILURE 

V 

int 

hd_extend( 

stmct sa_ext *saext) / pointer to extend inf omiation stmcture V 



register int lpi,cpi; T loop counters V 

register int rc; /* return code V 

register struct part *oldpp; 1* pointer to old part stnjct */ 

register struct part ^newpp; 1* pointer to new part struct V 

r 

* for the old number of logical partitions on the 

* logical volume, go through and search each possible 

* old copy. If the logical partition is not being 

* resynced, put the old part struct information 

* into the new part struct entry 
V 

rc= SUCCESS; 

for(lpi = 0; Ipi < saext- > old_numlps; Ipi + + ) { 
for(cpi = 0; cpi < saext- > old_npai1s; cpi + + ) ( 

if(saext-> klvj)tr-> partsff[cpi] 1= NULL) { 
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oldpp = (struct part *)(saext-> klvj)tr-> partsff[cpi] + Ipi); 
if(oldpp-> pvol != NUli) { 
if(cpi = = 0){ 
if(oldpp-> sync_trk != NO_SYNCTRK) { 
saext-> error = CFG SYNCER; 
rc = FAILURE; 
break; 



newpp = (struct part *) 

(*(saext -> newj)arts + cpi) + Ipi); 
*newpp=*oldpp; 
} rend if oldpp-> pvol != NULL*/ 
} r end if klv_ptr-> parts l=NULL*/ 
) /* end for number of old copies V 
if(rc = = FAILURE) 
break; 

} r end for old number of Ids */ 

r 

* if no errors were found, we can complete the 

* extend by filling in the Ivolstnjct with the 

* new info. 

V 

if(rc== SUCCESS)! 

saext-> klvj)tr-> nparts = saext-> nparts; 
saext->klv j)tr->nblocks=saext->nblocks; 
saext-> klvj3tr-> i_sched = saext-> isched; 
for(cpi = 0; cpi < saext-> nparts; cpi + + ) 
saext-> klv_ptr-> partsff[cpi] = 
saext-> newjDartsfffcpil; 
) r end if rc = = SUCCESS*/ 
return(rc): 



r 

* NAME: hdjeduce. 



FIG. 21-136 



EP 0482 853 A2 



* FUNCTION: Transfers old part struct information to new part stmct 

* information, and handles promotion if needed. 

* NOTES: 

* PARAMETERS: sared pointer to information stnjcture for the reduce 

* vg pointer to volume group structure 

* DATASTRUCTS: 

* RETURN VALUE: none 

V 

\^oid 

hdjeduce( 

struct sajed *sared, /* pointer to information on the reduce V 
stmct volgrp *vg) T pointer to volume group structure V 

1 

register int i.ppcnt,lpcnt,cpcnt; 

/* loop counters*/ 
register stnjct part *pp,*op,*np,*sp,*tp; 

r part stmct pointers*/ 
register struct extredj)art *pplist; 

I* pointer to array of ppinfo stmcls */ 
register int ppsleft; /* mask for pps left after reduction */ 
register int copy; /* holds copy of Ip we're processing */ 
register int redpps, cpymsk'/* masks for the logical partition V 
register int zeromsk; /* mask for copies to zero out */ 
register int size; /* size of old part stmcts to copy to new */ 

stmct part zeropp; /* zeroed out part stmct used to zero parts */ 

pp!ist = sared-> list; 

bzero((char *)(&zeropp), sizeof(stmct part)); 

r 

* go through the pps t)eing reduced and update the old copy as needed. 
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* Do the necessary promotions and deletions in the old copy PRIOR to 

* copying things over to the new copy. 
V 

cpymsk = MIRROR_EXIST(sared-> lv-> nparts); 
for(ppcnt s 1 , ppcnt < = sared- > numred; pplist -i- + , ppcnt + + ) { 
if(pplist->mask 1=0) { 
redpps = cpymsk | pplist-> mask; 

r 

* NOTE: redpps is a 3 bit field that can have the values 

* 0 (000) • 7 (111). The zero condition cannot exist on a reduce, 

* however. 

V 

switch(redpps) { 

r promote secondary to primary and tertiary to secondary */ 

case 1 : pp = PARTmON(sared-> lv.(pplist-> lp_num-1 ),PRIMMIRROR); 

sp = PARTmON(sared-> lv.(pplist-> lp_num-1),SINGMIRR0R); 

tp= PARTmON(sared-> lv,(pplist-> lp_num-l),DOUBMIRR0R); 

Jpp = *sp; 

* set up a mask to show the promoted ip 

* the bits will be off for good copies and on for 

* the copies that are now invalid. 
V 

*sp " *tp* 

zeromsk = TERTIARY_MIRROR; 
break; 

r promote tertiary to secondary V 

case 2: sp = PARTmON(sared-> lv,(pplist-> lp_num-1 ),SINGMIRR0R); 
tp = PARTmON(sared-> lv,(pplist-> lp_num-1),D0UBMIRR0R); 
*sp=*tp; 

zeromsk = TERTIARY JIRROR; 
break; 

r promote tertiary to primary V 

case 3: pp = PARTmON(sared-> lv,(pplist-> lp_num-1 j.PRIMMIRROR); 
tp = PAmTnON(sared-> lv.(pplist-> lp_num-1 ),DOUBMIRROR); 
*PP = *tp; 

zeromsk = (TERTIARY.MIRROR | SECONDARY_MIRROR); 
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break; 
/* no promotion*/ 
case 4: 
case 6: 

case 7: zeromsk = redpps; 
break; 

/* promote secondary to primary V 

case 5: pp = PARTmON(sared-> lv,(pplist-> lp_num-1 ),PRIMMIRROR); 
sp = PARTITION(sared-> lv,(pplist-> lp_num-1 ),SINGMIRROR); 
*pp = *sp; 

zeromsk = (TERTIARY.MIRROR | SECONDARY_MIRROR); 
break; 
}/*end switch*/ 

/* set up a mask of copies to zero out */ 
zeromsk & = 'cpymsk; 

/* zero out the necessary copies of the logical partition 7 
while(zeromsk != 0) { 

copy = FIRST_MASK(zeromsk); 

pp = PARTITION($ared-> lv,(pplist-> lp_num-1 ), copy); 

*pp = zeropp; 

zeromsk & = 'MIRROR_MASK(copy); 

} 

)/*endif7 
}/*endforppcnt*/ 

/* go through and transfer each copy to the new part structure V 
for(cpcnt = 0; cpcnl < sared-> nparts; cpcnt + + ) { 

size = sared-> numpis * sizeof(struct part); 

lxx)py(sared-> lv-> partsff[cpcnt], sared-> newpartsff[cpcntl, size); 

sared-> iv-> partsff[cpcnt] = sared-> newpartsff[cpcnt]; 

r NULL out the pointers to the copies that no longer exist V 
for(i = sared-> nparts; i < sared-> lv-> nparts; i + + ) 
sared-> lv-> partslfpl = NULL; 

/* 

* reset the Ivol stnicture with the values in the extred 

* stnjcture and loop through to put the newparts pointers 

* into the Ivol parts field 
V 
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sared-> lv-> nparts = sared-> nparts; 

sared-> lv-> nblocks = PAFn"2BLK(vg-> partshift, sared-> numlps; 

sared-> lv-> Lsched = sared-> isched; 

return; 

} 



r DASD.H V 

#ifndetH_DASD 
#define_H_DASD 

r 

* COMPONENT.NAME: (SYSXLVM) Logical Volume Manager - dasd.h 

* 

* © COPYRlGHTIntemationai Business Machines Corp. 1988. 1990 

* All Rights Reserved 

V 

r 

* Logical Volume Manager Device Driver data structures. 



#include < sys/lypes.h > 
#include<sys/sleep.h> 
#include<sys/locid.h> 
#inciude < sys/sysmacros.h > 
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include < sys/buf.h > 
#include < sys/lvdd.h > 

I* FIFO queue structure for scheduling logical requests. *l 
stnjct hd_queue { T queue header structure V 

struct buf*head; T oldest request in the queue V 
stnict but *taii; T newest request in the queue V 

1; 

struct hd__capvq { T queue header structure V 

stnjctpv_wait*head; T oldest request in the queue */ 
struct pv_wait *tail; T newest request in the queue *l 

1; 
r 

* structure used by hd_redquiet( ) to mark target PPs for removal. 

* Both are zero relative. 

7 

stmct hdjvred { 

long Ip; /* LP the pp belongs to V 
char min'or; 1* min*or number of PP V 

1: 
r 

* Physical request buf structure. 

♦ 

* A 'pbuf is a "buf structure with some additional fields used 

* to track the status of the physical requests that correspond to 

* each logical request A pool of pinned pbuf s is allocated and 

* managed by the device driver. The size of this pool depends on 

* the number of open logical volumes. 
7 

stmct pbuf ( 

r this must come first, 'buf pointers can be cast to 'pbuf 7 
struct buf pb; T imbedded buf for physical driver 7 

r physical buf structure appendage: 7 

struct buf 'pbjbuf; T corresponding logical buf stnjct 7 
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r scheduler I/O done policy function V 
#ifndef_NO_PROTO 

void (*pb_sched) (struct pbuf *); 

#else 

void f pbjsched) ( ): 

#endif 

struct pvorpbjDvol; T physical volume structure 7 
struct bad_blk *pb_bad; /* defects directory entry 7 
daddrj pbjtart; T startng physical address 7 

char pb_mirror; Tcurent mirror 7 

char pb_miravoid; /*min"or avoidance mask 7 

char pb_mirbad; /* mask of broken mirrors 7 

char pb_mirdone; /* mask of mirrors done ' 7 

char pbjwretry; r number of sw relocation retries 7 

char pbjype; T Type of pbuf 7 

char pb_bbop; T BB directofy operation 7 

char pb.bbstat; T status of BB directoiy operation 7 

uchar pb_whl_stop; I* wheeljdx value when this pbuf is7 

r to get off of the wheel 7 

#ifdef DEBUG 

ushort pb^hwjeloc; /* Debug - it was a HW reloc request7 
char pad; T pad to full long word 7 

#eise 

char padff[3]; T pad to full long word 7 

#endif 

stmct part *pbj)art; 1* ptr to part structure. Care must7 

r be taken when this is used since 7 
r the parts stnjcture can be moved 7 
r by hd_config routines while the 7 
r request is in flight 7 
stmct uniquejd *pb_vgid; T volume group ID 7 
r used to dump the allocated pbuf at dump time 7 
struct pbuf *pbJorw; Tfonward pointer 7 
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Struct pbuf*pb_back; f backward pointer V 

}; 

#define pb^addr pb.b_un.b,addr r too ugly in its raw form 7 
r defines for pb_swretry V 

#define MAX_SWRETRY 3 T maximum retries for relocabon 

before declaring disl( dead V 

r values for b_work in pbuf stnict (since real b work value only used 

* in Ibuf) 
*/ 

#define FIX_READ_ERROR 1 T fix a previous EMEDIA read error */ 
#define FIX_ESOR 2 /* fix a read or write ESOFT error V 
#define FIX_EMEDIA 3 T fix a write EMEDIA error V 

/•defines for pbJypeV 

#define SA_PVMISSING 1 /* PV missing type request / 

#define SA_STALEPP 2 r stale PP type request V 

#define SAJRESHPP 3 /* fresh PP type request V 

#define SAjCONRGOP 4 Tfidjconfig operation type request V 

r 

* defines to tell lid J)ldpbuf what kind of pbuf to build 

* These defines are not the only ones that tell hd_bldpbuf what to 

* build. Check the routine before changing/adding new defines here 
V 

#define CATYPE_WRT 1 T pbuf stmct is a cache write type / 

r 

•.defines for pb_bbop 

t 

* Rrst set is used by the requests pbuf that is requesting the BB operation. 
' The second set is used in the bbj)buf to control the action of the 

* actual reading and writing of the BB directory of the PV. 
V 
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#define BB_ADD 41 /* Add a new bad block entry to BB directory V 

#define BB_UPDATE 42 /* Update a bad block entry to BB directory 7 

#define BB_DELETE 43 T Delete a bad block enby to BB directory V 

#define BB_RDDFCT 44 /* Reading a defective block V 

#define BB_WTDFCT 45 /* Writing a defective block */ 

#defineBB^SWRELO 46 /* Software relocation in progress V 

#define RD_BBPRIM 70 T Read the BB primary directory */ 

#defineWT_UBBPRIM 71 r Write BB prim dir with UPDATE */ 

#define WT_DBBPRIM 72 T Rewrite BB prim dir 1st Wk with UPDATE */ 

#define WT UBBBACK 73 r Write BB backup dir with UPDATE '/ 

#define WT DBBBACK 74 T Rewrite BB back dir 1st bik with UPDATE V 



I* defines for pb^berror: 0^ (good) 64-127 (bad) */ 
#define BB.SUCCESS 0 TBBdir updating woriced V 
#define BB_CRB 1 TRelocblkno was changed in this BB entry */ 

#define BB_ERROR 64 /* Bad Block directories were not updated V 
#define BB FULL 65 TBBdir is full -no free bad bIk entries V 



* Volume group structure. 

it 

* Volume groups are implicitly open when any of their logical volumes are. 
V 



255 r implementation limit on # VGs V 

256 r implementation Gmit on # LVs V 
32 r implementation limit on number*/ 
r physical volumes per vg V 
8 /* Number of mwc cache queues V 

#defineNBPI (NBPB*sizeof(int)) /* Number of bits per int V 
#defineNBPL (NBPB*sizeof(long))r Number of bits per long V 



#define MAXVGS 
«defineMAXLVS 
#define MAXPVS 

#define CAHHSIZE 



r macros to set and dear the bits in the opn_pin array V 

#define SETLVOPN(Vg,N) ((Vg)-> opn_pinff[(N)/NBPIl |= 1 < < ((N)%NBPI)) 
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#define CLRLVOPN(Vg.N) ((Vg).> opnj)infll(N)/ NBPI] &= "(1 « ((N)%NBP1))) 
#deflne TSTLVOPN(Vg.N) ((Vg).> opn j)infl[(N)/ NBPI] & 1 « ((N)%NBPI)) 

r 

* macros to set and clear the bits in the caj}vwrt field 

* NOTE TSTALLPVWRT will not work if max PVs per VG is greater than 32 
*/ 

#defineSETPVWRT{Vg.N) ((Vg)->caj)v witf|(N)/NBPIl|=1«((N)%I^VS)) 
#define CLRPVWRT(Vg,N) ((Vg)-> cajv wrtffl(N) / NBPI] & = '(1 « ((N) % MAXPVS)) 
#defineTSTPVWRr(Vg.N) {(Vg)->cajv wrtff[(N)/NBPIl&(1 «((N)% MAXPVS))) 
#defineTSTALLPVWRT(Vg.N) ((Vg)->cajvjifftlll(MAXPYS- 1)/NBPg) 

r 

* head of list of varied on volgrp structs in the system 

V 

stmct 

' lockj lock; T lock while manipulating list of VG structs V 

struct volgrp * ptr; T ptr to list of varied on VG structs V 
) hd.vghead = {EVENT_NULL. NULL); 



struct volgrp { 

lockj vgjock; Tlock for all vg structures V 

short padi; T pad to tong word boundary */ 

short partshift; T log base 2 of part size in biks V 

short openjcount; T count of open logical volumes */ 

ushort flags; /*VG flags field */ 

ulong totjo_cnt; T number of logical request to VG 7 

struct Ivol *lvolsff[MAXLVS]; /* logical volume stmct array V 

struct pvol *pvolsfl[MAXPVS]; r physical volume struct array */ 

long major_num; Tmajornumber of volume group V 

stiucl unique Jd vgjd; T volume group id */ 

struct volgrp *nexWg; /* pointer to next volgrp stmcture */ 

I* Array of bits indicating open LVsV 

TAbitperLV V 
int opn j)inff[(MAXLVS + (NBPI - 1 ))/NBPIl; 
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pidj vonj3id; T process ID of the vaiyon process V 

r Following used in write consistency cache management 7 
struct volgrp *nxtactvg; /* pointer to next volgrp with V 

/* write consistency activity V 
struct pvjvait *cajreepvw; /* head of pv_wait free list 7 
struct pvjvait *ca_pvwmem; /*ptr to memory malloced for pvw 7 

r free list 7 
struct hd_queue ca_hld; f head/tail of cache hold queue 7 
ulong ca_pv^wrtff[(MAXPVS + (NBPL- 1)) / NBPLj; 

r when bit set write cadie to P v 7 
char cajnf lt_cnt; /* number of PV active writing cache7 

char cajize; /* number of entries in cache 7 

ushort caj)vwblked; P number of times the pvjvait free 7 

/* list has been empty 7 
stmct mwcjec *mwcjec; /* ptr to part 1 of cache ■ disk rec7 
struct ca_mwc__mp *ca j)art2; /* ptr to part 2 of cache - memory 7 
struct ca_mwc_mp *cajst; T mru/lru cache list anchor 7 
struct ca_mwc_mp ^caIhashff[CAHHSIZE]y* write consistency hash anchors7 

r the following 2 variables are used to control a cache clean up opera-7 
rtion. 7 
pidJ bcachwait; Tlist waiting at the beginning 7 

pidJ ecachwait; /* list waiting at the end 7 

volatile int waitjcnt; T count of cleanup waiters 7 

r the following are used to control the VGSAs and the wheel 7 

uchar quorum__cnt; T Number Indicating quonjm of SAs 7 

uchar wheeljdx; rVGSAwheelindexintopvols 7 

ushort whljseq_num; /* VGSA memory image sequence number7 

struct pbuf *sa_actjst; T head of list of pbufs that are 7 

r actively on the VGSA wheel 7 
stmct pbuf *sa_hldjst; /* head of list of pbufs that are 7 

r waiting to get on the VGSA wheel 7 
struct vgsa_area *vgsaj)tr; /* ptr to in memory copy of VGSA 7 
pidJ config_wait; TPID of process waiting in the 7 
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1; 



r hd__config routines to modify the V 

/* memory version of the VGSA V 
struct but sajbuf ; T logical buf struct to use to wrt V 

rtheVGSAs V 
struct pbuf . saj)buf; T physical buf struct to use to wrtV 

rtheVGSAs V 



r 

* Defines for flags field in volgrp stmcture 
V 

#define VG_SYSMGMT 0x0002 T VG is on for system management */ 

/•only commands */ 
#define VG_FORCEDOF 0x0004 T Should only be on when the VG was*/ 
^define VG_OPENING 0x0008 /* VG is being varied on V • 
r forced varied off and there were LVs still open. Under this con-7 
r dition the driver entry points can not be deleted from the device*/ 
r switch table. Therefore the volgrp structure must be kept T 
I* around to handle any rogue operations on this VG. */ 
#define GAJNFLT 0x001 0 /* The cache is being written or */ 

r locked */ 
#defineGA_VGACT 0x0020 /'This volgrp on mwc active list V 
#define CA_HOLD 0x0040 /'Hold the cache in flight */ 

#define CA_FULL 0x0080 T Cache is full -no free entries V 

#define SA_WHL_ACT 0x0100 T VGSA wheells active V 
#define SA WHL HLD 0x0200 T VGSA wheel is on hold V 
#define SA^WHLWAIT 0x0400 /* config function is waiting for */ 

r the wheel to stop */ 

r 

* Logical volume structure. 
V 

stnjctlvolf 

struct buf **work_Q; T work in progress hash table */ 
short lv_status; Tlv status: closed, closing, open */ 
short lv_options; Tlogical dev options (see below) */ 
short nparts; Tnum of part structures for this */ 

riv-base1 V 
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char Lsched; 1* initial scheduler policy state V 
char pad; /• padding so data word aligned V 
ulong nblocks; T LV length in blocks / 

struct part *partsf([3] r partition arrays for each mirror / 
ulong totjwrts; /* total number of writes to LV J 

ulong totjds; r total number of reads to LV / 

r These fields of the Ivol stmcture are read and/or written by 

* the bottom half of the LVDD; and therefore must be carefully 

* modified. 
*/ 
int 
int 



1; 



cmplcnt; 1* completion count-used to quiesce V 
waitlist; /* event list for quiesce of LV */ 



riv status: */ 

#define LV_CLOSED 0 
#define LV^CLOSING 1 
#define LV_OPEN 2 

r scheduling policies: V 
#defme SCH.REGULAR 0 
#define SCH_SEQUENTIAL 1 
#define SCH_PAF^ALLEL 2 
#define SCH.SEQWRTPARRD 
#define SCH.PARWRTSEQRD 



r logical volumes is closed */ 

r trying to close the LV ^ */ 
riogical volume is open */ 



r regular, non_min^ored LV V 
/•sequential write, seq read V 
r parallel write, read dosest */ 
r sequential write, read dosestV 
r parallel write, seq read V 



3 
4 



r logical device options: V 
#define LV_NOBBREL 
#define LV.RDONLY 
Udefine LV_DMPINPRG 
#define LV_DMPDEV 

tefmeLV_NOMWC 

#define LVJVRITEV 



0x0010 r no bad block relocation '/ 

0x0020 /* read-only logical volume V 

0x0040 r Dump in progress to this LV 

0x0080 r This LV is a DUMP device */ 
ri.e.DUMPINIT has been done V 

0x01 00 r no mirror write consistency */ 
r checking V 

WRITEV /* Write verify writes in LV */ 



V 
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r workj) hash algorithm - just a stub now V 
#defineHD HASH(Lb) \ 

(BLK2TRK((Lb)->b blkno)&(WORKQ_SIZE-1)) 



r 

* Partition structure. 
*/ 

structpart ( .... .1 

stmct pvol *pvol; /* containing physical volume / 

daddrj start; /* starting physical disk address*/ 
short syncjrk; /* current LTG being resynced */ 
char ppstate; /* physical partition state V 
char sync_msk; /* current LTG sync mask V 

}; 

^ Physical partition state defines PP_ and structure defines. 

The PP_STALE and PP_REDUCING bits could be combined into one but it 
is easier to understand if they are not and a problem arises later. 

The PP RIP bit is only valid in the primary part structure. 

'I 

#define PP_STALE 0x01 T Set virtienPP is stale / 
tefine PP CHGING 0x02 T Set when PP is stale but the V 

r VGSAs have not been completely */ 
/* updated yet V 
#define PP REDUCING 0x04 /* Set when PP is in the process V 

r of being removed(reduced out */ 
tefine PP_RIP 0x08 I* Set when a Resp is in progress */ 

r When set 'syncjrk' indicates V 
/* the track being synced. If */ 
/•syncjrk not = = -1 and PP_RIP 7 
r not set syncjrk is next trk '/ 
/* to be synced V 
#definePP_SYNGERR 0x10 /* Set when error in a partition V 
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/* being resynced. Causes the V 
r partition to remain stale. V 

#define NO_SYNCTRK -1 rTheLPdoesnothayearesync V 

r in progress */ 

r 

* Physical volume structure. 

* Contains defects directory hash anchor table. The defects 

* directoiy Is hashed by trai^k group within partition. Entries within 

* each congmence dass are sorted in ascending block addresses. 

* 

* This scheme doesn't quite work, yet. The congruence classes need 

* to be aligned with logical track groups or partitions to guarantee 

* that all blocks of this request are checked. But physical addresses 

* need not be aligned on track group boundaries. 
V 

#define HASHSIZE 64 /* number of defect hash classes V 

struct defect_tbl{ 

stmct bad Jlk 'defects ff[HASHSIZEl; T defect directory anchor */ 

1; 

stmct pvol{ 

devj dev; T devj of physical device */ 

daddrj armpos; Tlast requested arm position V 
short xfcnt; /* transfer count for this pv */ 

short pvstate; /*PV state V 
short pvnum; TLVMPV number 0-31 
short vg_num; TVG major number */ 
struct file* fp; T file pointer from open of PV V 
char flags; T place to hold flags */ 
char pad; r unused */ 
short num_bbdir ent; /* current number of BB Dir entries V 
daddrj fst_usrj)lk; /* first available block on the PV V 

/* for user data V 
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daddrj begjelblk; r first bikno in reloc pool */ 
daddrj nextjelblk; T bIkno of next unused relocation V 

r block in reloc bik pool at end */ 
r of PV V 
daddrj maxjelblk; T largest bIkno avail for reloc V 
struct defect_tbl *defect_tbl; T pointer to defect table V 
struct hd_capvq caj)v; /* head/tail of queue of request V 

I* waiting for cache write to V 
r complete V 
struct saj)v_whl { T VGSA information for this PV V 
daddrj Isn; T SA logical sector number - LV 0 */ 
ushort sa_seq_num; TSAwheel sequence number V 
char nukesa; T flag set if SA to be deleted */ 
char pad; /* pad to full long word 7 ^ 
) sa_areaff[2]; Tone for each possible SA on PV V 

stmcl pbuf pvjDbuf ; /* pbuf stnjcl for writing cache */ 

}; 

r defines for pvstatefieW */ 

#definePV_MISSING 1 TPV cannot be accessed V 
#define PV_RORELOC 2 r No HW or SW relocation allowed V 

r only known bad blks relocated V 

r 

* returns index into the bad block hash table for this block number 
V 

#define BBHASH IND(blkno) (BU<2TRK(blkno) & (HASHSIZE - 1 )) 



* Macro to return defect directory congmence class pointer 
V 

#define HASH_BAD(Pb,Bad_blkno) \ 

((Pb)-> pbj)vol-> defectJbl->defeclsff[BLK2TRK(Bad_blkno)&(HASHSIZE-1)l) 

r 

* Used by the LVM dump device routines same as HASH.BAD but the first 

* argument is a pvol struct pointer 
•/ 
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Kdefine HASH_BAD_DMP(Pvol,Blkno) \ 

((Pvol)-> defect_tbl-> defectsff[BLK2TRK(Blkno)&(HASHSIZE-1]) 

r 

* Bad block directory entry. 

struct bad_blk { T bad block directory entry 7 

stmct bad_blk *next; r next entry in congruence dass V 
devj dev; T containing physical device */ 

daddrj bikno; /* bad physical disk address V 
unsigned status: 4; /* relocation status (see below) 7 
unsigned relblk:28; T relocated physical disk address 7 

); 

r bad block relocation status values: 7 
#define REL_DONE 0 T software relocation completed 7 

#define REL_PENDING 1 r software relocation in progress 7 
Refine REL_DEVICE 2 /* device (HW) relocation requested 7 
#define REL.CHAINED 3 T relocation bik structure exists 7 
#define REL_DESIRED 8 r relocation desired-hi cider bit on7 

r 

* Macros for getting and releasing bad btock stmctures from the 

* pool of bad_blk structures. They are linked together by their next pointers, 
hd Jreebad" points to the head of bad_blk free list 

* NOTE: Code must check if hd freebad != null before callinq 

the GET BBLK macro. " 
7 " . 

#defineGET BBLK(Bad) {\ 

(Bad) = hdjeebad;\ 

hd.freebad = hd Jreebad-> next; \ 

hdJreebad_cnt-;\ 

#define RELJBLK(Bad) (\ 

(Bad)-> next = hd freebad; \ 
hdJreebad = (Bad);\ 
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hd freebad_cnt++;\ 

r 

'^Macros for accessing these data structures. 

#define VG_DEV2LV(Vg. Dev) ((Vg)-> lvolsff[minor(Dev)l) 
#define VG_DEV2LV(Vg. Pnum) ((Vg)-> pvolsff[(Pnum)]) 

#define BLK2PART(Pshifl,Lbn) ((ulong)(Lbn) > > (Pshift)) 
#define PART2BLK(Pshift,P no) ((P_no) < < (Pshift)) 
#define PARTITION(Lv,P_no,Mir) ((Lv).> partsffpr)] + (P_no)) 

r 

* Mirror bit definitions 
V 

#deflne PRIMARY_MIRROR 001 r primary mirror mask V 

#defineSECONDARY^MIRROR 002/* secondary mirror mask V 
#defineTERTIARY_MIRROR 004/* tertiary minror mask */ 

#deflneALL_MIRRORS 007 /* mask of all miror bits */ 

r macro to extract mirror avoidance mask from ext parameter */ 

#define X_AVOID(Exl) ( ((Ext) > >AVOID_SHFT) & ALL_MIRRORS ) 

r 

Macros to select mirrors using avoidance masks: 

FIRST_MIRROR returns first unmasked mirror (0 to 2); 3 if all masked 
F1RST_MASK returns first unmasked minror (0 to 2); 3 if none masked 
MIRROR_COUNT returns number of unmasked mirror (0 to 3) 
MIRROR_MASK returns a mask to avoid a specific mirror (1 , 2, 4) 
MIRROR^EXIST retums a mask for non-existent mirrors (0, 4, 6, or 7) 

#define FIRST^MIRROR(Mask) ((0x301 0201 0 > > ((Mask) < < 2))&0x0f) 

#define FIRST_MASK(Mask) ((0x01020103 > > ((Mask) < < 2))&0x0f) 

#define MIRROR,COUNT(Mask) ((0x01121223 > > ((Mask) < < 2))&0x0f) 

#define MIRROR_EXIST(Nmirrors) ((0x00000467 »((NmirTors)<<2))&0x0f) 
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#define MIRROR_MASK(Mirror) (1 < < (Mirror)) 



* DBSIZE and DBSHIFT were originally UBSIZE and UBSHIFTfrom param.h. 

* There were renamed and moved to here to more closely resemble a disk 

* block and not a user block size. 



V 

#define DBSIZE 
#define DBSHIFT 



512 /* Disk block size in bytes */ 
9 nog2ofDBSIZE V 



* LVPAGESIZE and LVPGSHIFT were originally PAGESIZE and PGSHIFT from pafam.h. 

* There were renamed and moved to here to isolate LVM from the changable 

* system parameters thai would have undesirable effects on LVM functionality. 



V 

#define LVPAGESIZE 
#define LVPGSHIFT 12 



4096 r Page size in bytes 
nog 2 of LVPAGESIZE V 



V 



V 



(LVPAGESIZE/DBSIZE) /* blocks per page 
(LVPGSHIF-DBSHIFT) T log 2 of BPPG V 

32 r pages per logical track group V 
5 rk)gbase2ofPGPTRK V 



#define BPPG 
#define BPPGSHIf=T 
#define PGPTRK 
#defineTRKSHIFT 
#define LTGSHIR 

SdefineBYTEPTRK . • , 

#define BLKPTRK PGPTRK *BPPG T blocks per logical track group 

IWefine SIGNED SHFMSK 0x80000000 /* signed mask for shifting to*/ 

r get page affected mask *l 




#define BLK2BYTE(Nblocks) 
#define BYTE2BLK(Nblocks) 
#define BLK2PG(Blk) 
#define PG2BLK(Pageno) 
#define BLK2TRK(Blk) 
#defineTRK2BLK(T_no) 
#define PG2TRK(Pageno) 



((unsigned)(Nblocks)<< (DBSHIFT)) 
((unsigned)(Nbytes) > > (DBSHIFT)) 
((unsigned)(Blk)>>BPPGSHIFr) 
((Pageno) < < (LVPGSHIFT-DBSHIFT)) 
({unsigned)(Blk) > > (TRKSHIFT + BPPGSHIFT)) 
((unsigned)(T_no) < < (TRKSHIFT + BPPGSHIFT)) 
((unsigned)(Pageno) > > TRKSHIFT) 
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/*LTG per partition*/ 

#define TRKPPART(Pshift) ((unsigned)(1 < < (Pshift - LTGSHIFT))) 
riTG in tfie partition V 

#defineTRKJN_PART(Pshift.BIJ<)(BU<2TRK(Blk)&(TRKPPART(Pshift) - 1) ) 



r defines for top half of LVDDV 
#define LVDD_HFREE_BB 
#defineLVDD_LFREE_BB 
#defineWORKQ.SiZE 
#definePBSUBPOOLSIZE 
#define HD_AUGN 
#define FULL_WORDMASK 
#defineBUFCNT 3 

/*structs 



64 
16 



30 r high water mark for kernel bad_blk stiijct */ 
15 riow water mark for kernel t)ad_blk struct V 
r size of LVs work in progress queue V 
r size of pbuf subpod alloc'd by PVs V 
r align characteristics for alloc'd memory V 
3 r mask for full word (log base 2) V 
r parameter sent to uphysiofor'#buf *l 
to allocate 7 



#defineNOMIRROR 0 
#define PRIMMIRROR 
#defineSINGMlRRQR 
UdefineOOUBMIRROR 

#defineMAXNUMPARTS 
«detinePVNUMVGDAS 



r no mirrors 7 

0 r primary mirror 7 

1 r one mirror 7 

2 /* two mirrors 7 

3 r maximum number of parts in a logical part 7 
2 /* max number of VGDA/VGSAs on a PV 7 



r refcjm codes for LVDD top 1/2 7 
#defineLVDD_SUCCESS 0 T general success code 



7 



#defineLVDD_ERROR 
#define LVDD NOALLOC 



-1 /* general error code 7 

-200 r hdjnit: not able to allocate pool of bufs7 



#endif/* H DASD7 



r HD,H 7 

#ifndef_H_HD 
#detine H HD 
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r 

* COMPONENT_NAME: (SYSXLVM) Logical Volume Manager Device Driver - hd.h 

* © COPYRIGHT International Business Machines Corp. 1988, 1990 

* All Rights Reserved 



7 



include <sys/emds.h> 



* LVDD internal macros and extern statidy declared variables. 

7 



riVM internal defines:*/ 
#define FAILURE 
#define SUCCESS 
#define MAXGRABLV 
«defineMAXSYSVG 3 
Idefine CAHEAD 
Ifdefine CATAiL 
#defineCA_MISS 
#defineCA_Hrr 
Idefine CA LBHOLD 



0 /* must be logic FALSE for? tests 7 
1 r must be logic TRUE for 'if tests 7 
16 r Max number of LVs to grab pbuf structs V 
r Max number of VGs to grab pbuf stnjcts 7 

1 r move cache entry to head of use list 7 

2 r move cache entry to tail of use list 7 

0 /*MWC cache miss 7 

1 TMWC cache hit 7 

2 r The logical request should hold 7 



r 

* Following defines are used to communicate with the kernel process 
7 

^define LVDD_KPJERM 0x80000000 T Terminate the kernel process 7 
#define LVDD_KP_BADBLK 0x40000000 r Need more bad_blk structs 7 
#define LVDD.KP.ACTMSK OxCOOOOOOO T Mask of all events 7 

r 

* Following defines are used in the b.options of the logical buf struct. 

* They should be reserved in lvdd.h in relaHonship to the ext parameters 
7 

#define REQJN_CACH 0x40000000 r When set in the Ibuf b.options 7 

r the request is in the mirror 7 
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r write consistency cache V 
tefine REQ^VGSA 0x20000000 r When set in the Ibuf b_options V 

r it means this is a VGSA write 7 
/* and to use the special saj)buf V 
rinthevolgrpstmcture 7 

* 

* The following variables are only used in the kernel and therefore are 

* only included if the ^KERNEL variable is defined. 

#ifdef_KERNEL 
#include < sys/syspesth > 

r 

* Set up a debug level if debug turned on 

#ifdef DEBUG 

ifdefLVDD_PHYS 

BUGVDEF(debuglvl,0) 

BUGXDEF(debuglvl) 

#endif 

#endif 

r 

* pending queue 

* This Is the primary data stmcture for passing work from 
the strategy routines (see hd_straLc) to the scheduler 

* (see hd jched.c) via the mirror write consistency logic. 
From this queue the request will go to one of three other 

* queues. 

* 1 . cache hold queue - If the request involves mirrors 
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* and the write consistency cache is in flight. 

* i.e. being written to PVs. 

2. cache PV queue - If the request must wait for the 

* write consistency cache to be written to the PV. 

* 3. schedule queue - Requests are scheduled from this 

queue. 

* This queue is only changed within a device driver critical section. 
V 

#ifdef LVDD_PHYS 
stnjcthdjqueue pendingjQ; 
#else 

extern struct hdjqueue pending_Q; 
#end'rf 

r 

* ready queue - - physical requests that are ready to start. 

* This queue is only valid within a single critical section. 

* It really contains a list of pbuf s, but only the imbedded 
buf struct is of interest at this point Since the pointers 

* are of type (struct buf *) it is convenient that the queue be 

* declared simiiariy. 
V 

#ifdefLVDD_PHYS 

stmctbuf *ready_Q = NULL; 

#else 

extem struct buf *readyjQ; 
#endif 



r 

* Chain of free and available pbuf stmcts. 
V 

#ifdef LVDD_PHYS 

struct pbuf *hdJreebuf = NULL: 

FIG. 21-158 



EP 0 482 853 A2 



#else 

extern struct pbuf *hd_freebuf; 
#endif 

r 

* Chain of pbuf stmcts currently allocated and pinned for LVDD use. 

* Only used at dump time and by crash to find them. 

V 

#ifdef LVDD_PHYS 

stmct pbuf *hd_dmpbuf = NULL; 
#else 

extern struct pbuf *hd_dmpbuf ; 
#endif 



Chain and count of free and available bad_blk stmcts. 
The first open of a VG, really the first open of an LV, will cause 
LVDD_HFREE_BB( currently 30 ) bad_blk structs to be allocated and 
chained here. After that when the count gets to LVDD_LFREE_BB(low 
water mark, currently 15) the kemel process will be kicked to go 
get more up to LVDD_HFREE_BB( high water mark ) more. 

*NOTE* hd_freebadjk is a lock mechanism to keep the top half of the 
driver and the kemel process from colliding. This would only 
happen if the last request before the last LV closed received 
an ESOfT or EMEDIA( and request was a write ) and the getting of 
a bad_blk struct caused the count to go below the low water 
mark. This would result in the kproc frying to put more 
stmctures on the list while hd^dose via hd Jrefrebb would 
be removing them. 



r 



#ifdefLVDD_PHYS 
int \ 
stmct bad_blk 
int \ 
#eis6 

int \ 



hdJreebadJk=LOCK_AVAIL: 

*hdjreebad = NULL; 
hdjreebad_cnt = 0; 



hdjreebadjk; 
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extern struct bad_blk *hd_freebuf; 
extern int hdjreebadjcnt; 
#endif 

r 

* Chain of voigrp structs that have write consistency caches that need 

* to be written to PVs. This chain is used so all incoming requests 

* can be scanned before putting the write consistency cache in flight 

* Once in flight the cache is locked out and any new requests will have 

* to wait for all cache writes to finish. 
V 

#ifdefLVDD_PHYS 

struct voigrp *hd_vg_mwc = NULL; 

#else 

extern stmct voigrp *hd__vg__mwc; 
#endif 

r 

* The following arrays are used to allocate mirror write consistency 

* caches in a group of 8 per page. This is due to the way the hide 

* mechanism works only on page quantities. These two an'ays should be 

* treated as being in lock step. The lock, hd_caJock, is used to 

* ensure only one process is playing with the arrays at any one time. 
7 

#define VGS.CA ((MAXVGS + (NBPB-1))/NBPB) 
#ifdefLVDD_PHYS 

lockj hd_caJock = LOCK_AVAIL; /* lock for cache an'ays 7 

char ca^allocedff[VGS_CAl; /* bit per VG with cache allocated 7 
struct mwcjec *cajgrpj)trff[VGS_CA]; n foreachSVGs 7 
#else 

extern lockj hd_cajxk; 
extern char ca_allocedff[l; 
extern stmct mwc rec *ca arpotrfffj; 
#endif 

r 

* The following variables are used to control the number of pbuf 
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* structures allocated for LVM use. It is based on the number of 

' PVs in varied on VGs. The first PV gets 64 structures and each 

* PV therefore gets 1 6 more. The number is reduced only when a 

* VG goes inactive. i.e. all ifs LVs are closed. 

V 

#ifdefLVDD_PHYS 

inthdj)buf_cnt =0; Total Number of pbufs allocated 7 

Int hdj)bufjgrab = PBSUBPOOLSIZE; /* Number of pbuf structs to allocate 

/* for each active PV on the system V 

int hd_pbuf_min = PBSUBPOOLSIZE * 4; 

I* Number of pbuf to allocate for the V 
/* first PV on the system 7 ^ 

inthd_vgs_opn =0; T Number of VGs opened 7 

int hdJvs_opn = 0; /* Number of LVs opened 7 

int hdj)vs_opn =0; T Number of PVs in varied on VGs 7 

inthd_pbufjnuse =0; /* Number of pbufs currently in use 7 

inthdj)buf maxuse =0; T Maximum number of pbufs in use during7 

r this boot 7 

#else 

extem int hd_pbuf_cnt; 
extem int hdj)buf jgrab; 
extem inthdj)buf_min; 
extem int hd_vgs_opn; 
extem int hdjvsj)pn; 
extem int hdj)vs_opn; 
extem int hdj)bufjnuse; 
extem int hd_pbuf_maxuse; 
#endif 



r 

I The following are used to update the bad block directory on a disk 
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#ifdef LVDD_PHYS 

struct plxif *bb_pbuf ; T ptr to pbuf reserved for BB dir updating V 
struct hd_queue bb_hld; T holding Q used when there is a BB V 

r directory update in progress V 

#eise 

extern struct pbuf *bb _pbuf; 
extern struct hd queue bbhid; 
#endif 

r 

*The following variables are used to communicate between the LVDD 

* and the kernel process. 
V 

#ifdef LVDD_PHYS 

pidj hd_kpjd =0; TPID of the kernel process V 
#else 

extern pidJ hd_kpid; 
#endif 

r 

* The following variables are used in an attempt to keep some infomnation 

* around about the perfomiance and potential bottle necks in the driver. 
I Currently these must be looked at with crash or the kemel debugger. 

#ifdefLVDD_PHYS 

ulong hd_pendqblked =0;r How many times the scheduling queue 7 

r (pending_Q) has been block due to no 7 
r pbufs being available. 7 

#else 

extern ulong hd j)endqblked; 
#endif 

r 

* The following are used to log error messages by LVDD. The de Jata 
' is defined as a general 16 byte array, BUT, ifs actual use is 

* totally dependent on the error type. 
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# define RESRC_NAME 'LVDD' /* Resource name for error logging V 
struct hd_errlog_ent{ /* Error log entry stmcture V 

stmct errjecO id; 
charde dataff[16]; 

1; ~ 

r macros to allocate and free pbuf structures *l 
#defineGET_PBUF(PB) {\ 

(PB) = hd_freet)uf; \ 

hdjreebuf = (struct pbuf *) hdJreebuf-> pb.avJoniv; \ 
hd_pbufJnuse + + ;\ 
if( hd_pbufjnuse > fid j)buf_maxuse ) \ 
hdj)buf_maxuse = hd j)bufjnuse; \ 

#define REL_PBUF(PB) {\ 

(PB)-> pb.avjonw = (struct buf *) fidjreebuf ; \ 
tidJreebuf = (PB);\ 
hdjibufjnuse-; \ 

} 

r macros to allocate and free pv_wait stmctures V 
#define GET_PVWAIT(Pvw, Vg) { \ 

(Pvw) = (Vg)->caJreepvw; \ 

(Vg)-> cajreepvw = (Pvw)-> nxt_pv_wait; \ 

#define REL_PVWAIT(Pvw, Vg) ( \ 

(Pvw)-> nxtj)vjtfait = (Vg)-> cajreepvw; \ 
|Vg)-> ca.freepvw = (Pvw); \ 

#defineTST_PVWAIT(Vg) ((Vg).> cajeepvw = = NULL) 

r 

* Macro to put volgrp ptr at head of the list of VGs waiting to start 
*MWC cache writes 

V 
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* define CA_VG_WRT(Vg) {\ 

if( !((Vg)-> flags &CA_V6ACT))\ 

{\ 

(Vg)-> nxtactvg = hd_vg_mwc; \ 
hd_vg_mwc = (Vg); \ 
(Vg)->flags| = CA_VGACT;\ 

r ' 

* Macro to determine if a physical request should be returned to 

* the scheduling layer or continue(resume). 

#definePB_CONT(Pb) (\ 
if(((Pb)-> pb_addr = = ((Pb)-> pbJbuf-> b_baddr + (Pb)-> pb lbuf-> b bcount)) 11 \ 
((Pb)->pb.bJags&BJRROR))\ 
HD_SCHED((Pb));\ 

else\ 

hdjesume((Pb));\ 

r' 

* HD_SCHED - ■ invoke scheduler policy routine for this request 

*^ For physical requests it invokes the physical operation end policy. 
#defineHD_SCHED(Pb) r(Pb)-> pb_sched)(Pb) 

/* define for b_error value (only used by LVDD) V 
#define ELBBLOCKED 255 r this logical request is blocked by 7 

r another on in progress V 

#endif r .KERNEL'/ 

r 

I^Write consistency cache structures and macros 



FIG. 21-164 



EP 0 482 853 A2 



r cache hash algorithms - returns index into cache hash table V 

#define CA_HASH(Lb) (BLK2TRK((Lb)-> b_blkno) & (CAHHSIZE-1 )) 

#define CA JHASH(Trk) ((Trk) & (CAHHSIZE-1 )) 

r 

* This structure will generally be referred to as part 2 of the cache 

struct ca_mwc_mp { r cache mirror write consistency memory only part V 

stnjct ca_mwc_mp *hq_next; /* ptr to next hash queue entry V 
char stale; /* Stale of entiy 7 

char pad1; /* Pad to word 7 

ushort iocnt; T Non-zero -io active to LTG 7 

struct ca_mwcjdp •parti; r Ptr to parti entry -cajmwcjp 7 
struct ca_mwc__mp *next; r Next memory part struct "7 
struct ca mwc_mp *prev; r Previous memory part struct 7 

1; 

/*ca_mwc_mp state defines 7 

#define CANOCHG 0x00 T Cache entry has NOT changed since last 7 

r cache write operation, but is on a hash 7 
r queue somewhere 7 

#define CACHG 0x01 T Cache entry has changed since last cache 7 

/* write operation 7 

lldefine CACLEAN 0x02 r Cache entry has not been used since last 7 

r dean up operation 7 

r 

* This structure will generally be referred to as part 1 of the cache 

* In order to stay long word aligned this structure has a 2 byte pad. 

* This reduces the number of cache entries available in the cache. 
7 

struct ca_mwc_dp { 1* cache mirror write consistency disk part 7 

ulong Ivjg; /*LV logical track group 7 

ushort lv_minor; /* LV minor number 7 

short pad; 

}; 
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#defineMAX_CAJNT 62 T Max number that will fit in block V 

r 

* This structure must be maintained to be 1 block in length(512 bytes). 

* This also implies tiie maximum number of write consistency cache entries. 
V 

struct mwc_rec{ T mirror write consistency disk record */ 

stnjcl timeslmcj b_tmstamp; /* Time stamp at beginning of block V 

stnjd ca_mwc_dp caj)lf([MAX_CA_ENT]; /* Reserve 62 part 1 stmctures V 

stnjd timeslmcj e_tmstamp; ? Time stamp at end of block 7 

1; 



* This structure is used by the MWCM. It is hung on the PV cache write 

* queues to indicate which Ibufs are waiting on any particular PV. The 

* define controls how much memory to allocate to hold these structures. 

* The algorithm is 3 * CA__MULT * cache size * size of structure. 
V 

Udefine CA_MULT 4 /* pv.wait * cache size multiplier V 

stmctpv_wait| 

struct pvjvait *nxtj)v_wait; /* next pvjwait structure on chain V 
stnjctbuf Ibjwait; Tptr to Ibuf waiting for cache V 

}; 
r 

* LVM function declarations - an^nged by module in order by how they occur 

* in said module. 
V 

#ifdef_KERNEL 
#ifndef_NO_PROTO 

/* hd_mircach.c V 

extem int hd_ca_ckcach ( 

register struct buf *lb, T current logical buf struct V 
register struct volgrp \g, Tptr to volgrp structure V 
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register struct Ivol *lv); T ptr to Ivol structure V 

extern void hd_ca_use ( 

register struct volgrp *vg, T ptr to volgrp structure V 
register struct ca_mwc__mp *ca_ent/* cache entry pointer V 
register int hj); /* head/tail flag *l 

extern struct ca_mwc_mp *hd_ca_new ( 

register stmct volgrp *vg);/* ptr to volgrp stmcture */ 

extern void hd_ca_wrt (void); 

extern void hd_ca_wend ( 

register struct pbuf *pb); T Address of pbuf completed *l 

extern void hd_ca_sked ( 

register struct volgrp *vg, T ptr to volgrp structure */ 

register struct pvol *pvol); Tpvol ptr for this PV */ 

extern stmct ca_mwc_mp *hd_caj nd ( 

register struct volgrp *vg, /* ptr to volgrp structure 7 
register stnjct buf lb); T ptr to Ibuf to find the entry V 

rfor */ 

extern void hdjcajclnup ( 

registeFstrucl volgrp *vg);/* ptr to volgrp stmcture 7 

extern void hd_ca_qunlk ( 

register stmct volgrp *vg, /* ptr to volgrp stmcture 7 
register stmct ca_mwc_mp *ca_ent)/^ ptr to entry to unlink 7 

extern int hd_caj)vque ( 

register stmct buf *lb, /* current logical buf struct 7 

register stmct volgrp *vg, /* ptr to volgrp stmcture 7 

register stmct Ivol *lv); /* ptr to Ivol stmcture 7 

extern void hd__ca_end ( 

register stmd pbuf *pbj; T physical device buf stmct 7 
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extern void hd_ca_term ( 

register struct but *lb); 1* current logical but struct V 

extern void hd_ca_mvhld ( 

register struct volgrp *vg);r pti^ to volgrp structure V 

/* hdjdump.c V 

extern int hdjdump ( 

devj dev, T major/minor of LV 7 

stmct uio *uiop, /* ptr to uio struct describing operation*/ 

int cmd,r dump command */ 

char *arg, /* cmd dependent - ptr to dmpjquery stmct */ 

int Chan, T not used V 

int ext); r not used 

extern int hd_dmpxlate ( 

register devJ dev, /* major/minor of LV */ 

register stmct uio *luiop, /* ptr to logical uio structure 7 

register stmct volgrp *vg);r ptr to VG from device switch table7 

r hdjopx 7 

extern int hd_open ( 

devJ dev, T device number major,minor of LV to be opened 7 
int flags, /* read/write flag 7 



int Chan, T not used 
int ext); /* not used 



7 
7 



extem int hdjallocpbuf(void); 



extern void hdj)bufdmpq( 
register stmct pbuf'pb, 
register stmct pbuf**qq); 



/* new pbuf for chain 7 
/* Ptr to queue anchor 7 



extem void hd_openbkout( 

int bopoint, T point to start backing out 7 
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Struct volgrp *vg); /* struct volgrp ptr 7 

extern void hdj3ackout( 

int bopoint/ point where error occurred & need to V 
r backout all structures pinned before V 
r this point 7 
stnjctlvol *lv, Tptr to Ivol to backout 7 
struct volgrp *vg); r struct volgrp ptr . 7 

extern int hd_close( 

devj dev, /* device number major,minor of LV to be closed 7 
int Chan, T not used 7 
int ext); T not used 7 

extern void hd_vgcleanup( 

struct volgrp *vg); T struct volgrp ptr 7 

extern void hdJrefrebb(void); 

extern int hd_allocbblk(void); 

extern inthdjead( 

devJ dev, /*nummajor,minorof LVtoberead 7 

struct uio*uiop, T pointer to uiostmcture that spedfies 7 
r location & lengtii of caller's data buffer*/ 
int Chan, /* not used 7 
int ext); /* extension parameters 7 

extern int hd_write( 

devJ dev, /* num major,minor of LV to be written 7 

struct uio *uiop, /* pointer to uio structure that specifies 7 
r location & lengtfi of caller's data buffer7 
int Chan, /* not used 7 
int ext); T extension parameters 7 

extern inthd_mincnt( 

strucfbuf •bp, T ptr to buf struct to be checked 7 

void *minpanns); /* ptr to ext value sent to uphysio by7 
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I* hd_readyhd_write. 7 

extern int hdJoctl( 

devj dev. 1* device number major,minor of LV to be opened V 

int cmd, r specific ioctl command to be performed V 

int arg, T addr of parameter bik for the specific cmd */ 

int mode, /* request origination V 

int chan, T not used 7 

int ext);rnotused7 - 

extern stmct mwcjec * hd_alloca(void); 

extern void hd_dealloca( 

register stnjct mwcjec *ca_ptr); /* ptr to cache to free 7 

extern void hd_nodumpvg( 
stnjctvolgrp *); 

/*hd j)hys.c 7 

extern void hd_begin( 

register stmct pbuf *pb, 1* physical device buf struct 7 
register struct volgrp*vg); /* pointer to volgrp stmct 7 

extemvoidhd_end( 

register stmct pbuf *pb); r physical device buf stmct 7 

extemvoidhdjesume( 

register stmct pbuf *pb); I* physical device buf stmct 7 

extemvoidhd_ready( 

register stmct pbuf 'pb); /* physical request buf 7 

extern void hd_start(void); 

extern void hd _gettime( 

register stmct timestmcj *o_time); /* old time 7 
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rhd_bbrel.cV 

extern int hd_chkblk( 

register struct pbuf *pb); 1* physical device but struct V 

extern void hd_bbend( 

register struct pbuf 'pb); T pliysical device buf struct V 

extern void hd_baddone( 

register struct pbuf *pb); r physical request to process 7 

extern void hd_badbll(( 

register struct pbuf *pb); 1* physical request to process V 

extern void hd_swreloc( 

register struct pbuf *pb); T physical request to process V 

extern daddrj hd_assignalt( 

register struct pbuf 'pb); I* physical request to process V 

extern struct bad_blk *hd_fndbbrel( 

register struct pbuf *pb); T physical request to process V 

extern void hd__nqbblk( 

register struct pbuf *pb); f physical request to process 7 

extern void hd_dqbblk( 

register stmct pbuf *pb, r physical request to process 7 
register daddrj bikno); 

/*hd_sched.c7 

extern void hd_schedule(void); 

extern inthd_avoid( 

register struct buf *ib, Tlogical request buf 7 
register struct volgip *vg);/* VG volgrp ptr 7 

FIG. 21-171 



EP0 482 853 A2 



extern void hd_resyncpp( 

register struct pbuf *pb); T physical device but struct 7 

extern void hd_freshpp( 

register struct volgrp *vg, /* pointer to volgrp struct */ 
register stmct pbuf *pb); /* physical request buf *l 

extern void hd__min"ead( 

registefstmct pbuf *pb); r physical device but struct 7 

extern void hd_fixup( 

register struct pbuf *pb); T physical device but struct 7 

extern void hd_stalepp( 

register struct volgrp *vg, /* pointer to volgrp stmct 7 
register struct pbuf 'pb); f physical device buf struct 7 



extern void hd_staleppe( 

register struct pbuf *pb); T physical request buf 7 

extern void hd_xiate( 

register stmct pbuf *pb, T physical request buf 7 

register int mirror, T mirror number 7 

register stmct volgrp *vg);rVG volgrp ptr 7 

extern int hd_regular( 

register stmct buf *ib, riogical request buf 7 



register stmct volgrp Vg)/ volume group stmcture 7 

extern void hdjnished( 

register stmct pbuf 'pb); T physical device buf struct 7 

extem inthd_sequential( 

register stmct buf *lb, /* logical request buf 7 
register stmct volgrp *vg);/* volume group stmcture 7 

extem int hd_seqnext( 

register stmct pbuf *pb /* physical request buf 7 
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register struct volgrp *vg)i 1* VG volgrp pointer */ 

extern void hd_seqwrite( 

register struct pbuf *pb) ; /* physical device but struct V 

extern int hd_parallel( 

register struct but *lb, riogicai request but V 
register struct volgrp *vg);/* volume group structure V 

extern void hdjreeall( 

register struct pbuf *q); /* write request queue */ 

extern void hd_append( 

register struct pbuf *pb, T physical request pbuf */ 
register struct pbuf **qq); /* Ptr to write request queue anchor */ 

extern void hd_nearby( 

register struct pbuf *pb, T physical request pbuf */ 

register struct buf *lb, /* logical request buf */ 

register int mask, /* min-ors to avoid */ 

register struct volgrp *vg, T volume group structure */ 
register struct Ivol *lv); 

extern void hdj)antfrite( 

register struct ptiuf *pb); I* physical device buf struct V 

r hd_strat.c 7 

extern void hd_strategy( 

register struct buf lb); I* input list of logical buf structs *l 

extern void hdjnitiate( 

register stnjct buf *lb); /* input list of logical buf s V 

extem stmct buf *hd jeject( 

struct buf *lb, /* offending buf structure */ 
int ermo); /* error number V 
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extern void hd_quiescevg( 

struct volgrp *vg); T pointer from device switch table V 

extern void hd_quiet( 

devj dev. /* number major.minor of LV to quiesce */ 

struct volgrp *vg); Tptr from device switch table V 

extern void hdjedquiet( 

devJ dev, /* number major.minor of LV */ 

struct hdjvred *redjst): T ptr to list of PPs to remove */ 

extern int hd_add2pool( 

register stmct pbuf *subpool, T ptr to pbuf sub pool V 
register struct pbuf *dmpq); r ptr to pbuf dump queue V 

extem void hdjlealiocpbuf(void); 

extern int hd_numpbufs(void); 

extem void hdjerminate( 

register struct buf lb); /* logical buf struct V 

extem void hd_unblock( 

register struct buf *next, T first request on hash chain */ 
register struct buf *lb); 1* logical request to reschedule*/ 

extem void hd_quelb ( 

register struct buf % f current logical buf struct V 
register struct hd_qu6ue*que); f queue structure ptr V 

extem int hd_kdisjnitmwc( 

struct volgrp *vg); 1* volume group pointer */ 

extem int hd_kdis_dswadd( 

register devJ device /* device number of the VG V 
register stnjct devsw *devsw); J* address of the devsw entry V 

FIG. 21-174 



EP 0 482 853 A2 



extern int hd_kdls_chgqmn( 

struct volgrp *vg), /* volume group pointer V 
short newqrm); /* new quorum count*/ 

extem int hd_kproc(void); 

/* hd_vgsac 7 

extem int hd ja_strt( 

register struct pbuf *pb, T physical device but struct 7 



extem void hd_saJodone( 

register stmct buf *lb) ; T ptr to Ibuf in VG just completed 7 

extem void hd_sa_cont( 

register stmct volgrp *vg, /* volgrp pointer 7 
register int sa__updated); r ptr to Ibuf in VG just completed 7 

extem void hdjsa_hback( 

register struct pbuf 'head j)tr, T head of pbuf list 7 

register stmct pbuf *newj)buf); t ptr to pbuf to append to list 7 

extem void hd_sa_rtn( 

register stmct pbuf *headj3tr, /* head of pbuf list 7 



register stmct volgrp *vg, /* volgrp pointer 
register int type); /* type of request 



7 
7 



extem void hd_sa_wrt( 



register stmct volgrp *vg);/* volgrp pointer 



register int errjg); 



r if tme retum requests with 



7 



rENXIO error 



extem int hd_sa_whladv( 



register stmct volgrp *vg, f volgrp pointer 



register int cj/hljdx) ; /* current wheel index 



7 
7 



extem void hd_sa_update( 



register stmct volgrp *vg);/* volgrp pointer 
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extern int hd_sa_qrmchk( 

register struct volgrp *vg) ;/* volgrp pointer V 

extern int hd_sa_config( 

register stnjct volgrp *vg /* volgrp pointer V 
register int type, /* ^pe of hd_config request */ 
register caddrj arg); /*ptr to arguments for the request V 

extern int hd_sa_onerev( 

register stmd volgrp *vg, /* volgrp pointer */ 

register stmct pbuf *pv, r ptr pbuf structure V 

register int type); /* type of hd_config request V 

extern void hd_bldpbuf ( 

register stmct pbuf "pb, f* ptr to pbuf struct V 

register stnjctpvol *pvol, /* target pvol ptr V 

register int type, /* type of pbuf to build V 

register caddrj bufaddr,r data buffer address -system V 

register unsigned cnt /* length of buffer */ 

register stmct xmem *xmem, T ptr to cross memory descriptor*/ 

register void (*sched)()); /* ptr to function ret void 7 

extern int hd_extend ( 

register sa_ext*saext); /* ptr to stmcture with extend info V 

extern void hd_reduce( 

stmct sa_red*sared, /* ptr to stmcture with reduce info V 
stmct volgrp \g); T ptr to volume group stmcture V 

r hd_bbdir.c V 

extem void hd_upd_bbdir( 

register stmct pbuf *pb); 1* physical request to process V 

extem void hd_bbdirend( 

register stmct pbuf *vgpb): /*ptrtoVGbb_pbuf V 

extem void hd_bbdirop( void ); 
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extern int hd_bbadd( 

register struct pbuf*vgpb); /*ptrtoVGbbj)buf V 

extern int hd_bbdel( 

register struct pbuf *vgpb); /* ptr to VG bb jDbuf V 

extern inthd_bbupd( 

register struct pbuf *vgpb); T ptr to VG bb _pbuf 7 

extern void hd_chk_bbhld( void ); 

extern void hd_bbdirdone( 

register struct pbuf *origpb); T physical request to process */ 

extern void hdJogerr( 

register unsigned id, T original request to process V 
register ulong dev. r device number V 
register uiong arg1, 
register ulong arg2); 

#else 

r See above for description of call arguments V 
r hd_mircach.c V 

extern int hd__ca__ckcach (); 
extern void hd_ca_use(); 
extern struct ca__mwc mp *hd_ca_new ( ); 
extern void ~ [id_cajiwt(); 
extem void fid_ca_wend ( ); 

extem void hd_cajked ( ); 

extem struct ca_mwc_mp *hd_cajnd ( ); 
extem void hd_ca_clnup(); 
extem void hd_ca_qunlk(); 
extem int fidjca j)vque (); 
extem void fid_ca_end ( ); 
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extern void hd_ca_temi ( ); 

extern void hd_ca__mvhld ( ); 

r hd_dump.c V 

extern int hd_dump ( ); 

extern int hd_dmpxlate ( ); 

r hd_top.c 7 

extern int hd_open ( ); 

extern int hd_allocpbuf ( ); 

extern void hd_pbufdmpq ( ); 

extern void hd^openbkout ( ); 

extern void hd_backout ( ); 

extern int hd_close ( ); 

extern Int hd_vgcleanup(); 

extern void hdjrefrebb ( ); 

extern int hd.ailxbblk ( ); 

extern int hdjead ( ); 

extern int hd_write (); 

extern int hd_mincnt(); 

extern int hdJoctl(); 
extern struct mwcjec *hd_alloca ( ); 

extern void hdjdealloca { ); 

extern void hd_nodumpvg ( ); 

r hdj)hys.c V 

extern void hd_begin(); 

extern void hd_end ( ); 

extern void hdjesume ( ); 

extern void hdjeady(); 

extern void hdjtart(); 

extern void hdjgettime ( ); 



/*hd_bbrel.cV 
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extern int hd_chkblk ( ); 

extern void hd_bbend ( ); 

extern void hd_baddone ( ) ; 

extern void hd_badblk ( ); 

extern void lid_swreloc(); 

extern daddrj hd_assignalt ( ); 
extern struct bad_blk *hd_fndbbrel ( ); 

extern void hd_nqbblk(); 

extern void hdjdqbblk ( ); 

r hd_sched.c V 

extern void hd_schedule ( ); 

extern int hdjivoid ( ); 

extern void hdjesyncpp ( ); 

extern void hdJreshppO; 

extern void hd^mirread ( ); 

extern void hd Jxup ( ); 

extern void hd_stalepp ( ); 

extern void hdjtaieppe ( ); 

extern void hd_xlate ( ); 

extern int hdjegular( ); 

extern void hdjinished ( ); 

extern int hd_sequential ( ); 

extern int hdjeqnext ( ); 

extern void hdjeqwrite ( ); 

extern int hd_parallel ( ); 

extern void hdjreeall ( ); 

extern void hd_append ( ); 

extern void hd_nearby ( ); 

extern void hd j)an«rite ( ); 

r iid_strat.c V 

extern void hdjtralegy ( ); 

extern void hdjnitiate ( ); 
extern stmctbuf *hdjeject(); 

extern void hdjquiescevg ( ); 
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extern int hd_bbupd ( ); 

extern void hd_chk_bbhld ( ); 

extern void hd_bbdirdone ( ); 

extern void hdJogerr(); 

#endif ruojmov 

#endif /*_KERNEL V 

#end(fr H HDV 
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